hypothesis
active
hypothesis:soo-fine-tuning-may-provide-robustness-against-sleeper-agent-deception-scenarios-where-intent-is-concealed-over-extended-periods

SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periods

Future work hypothesis about testing SOO against adversarial sleeper agent scenarios

Source paper

extracted_from
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Concepts (1)

concept
  • Adversarial scenario where an AI conceals deceptive intent over extended periods; identified as future test for SOO

Questions (1)

question

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.