hypothesis

active

hypothesis:soo-fine-tuning-may-provide-robustness-against-sleeper-agent-deception-scenarios-where-intent-is-concealed-over-extended-periods

SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periods

Future work hypothesis about testing SOO against adversarial sleeper agent scenarios

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Papers (1)

paper

Towards Safe and Honest AI Agents with Neural Self-Other Overlap
introduces

Concepts (1)

concept

Sleeper Agent Scenario
about
Adversarial scenario where an AI conceals deceptive intent over extended periods; identified as future test for SOO

Questions (1)

question

How robust is SOO fine-tuning against adversarial settings such as sleeper agent scenarios?
gates
Open research question about SOO's effectiveness against sophisticated deception

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SOO fine-tuning significantly reduces deceptive behavior in LLMs while maintaining general task performanceclaim0.826
Central empirical claim of the paper supported by three LLM experiments
SOO fine-tuning could complement RLHF and Constitutional AI by fostering internal coherence that promotes honest behaviorsclaim0.796
Integration claim positioning SOO as additive to existing alignment approaches
SOO fine-tuning reduced Escape Room deception in Gemma-2-27B from 98.8% to 6.5%finding0.795
SOO fine-tuning showed strong generalization to Escape Room for Gemma-2-27B
SOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferenceshypothesis0.792
Future work hypothesis about extending SOO to direct value alignment
SOO fine-tuning reduced Escape Room deception in CalmeRys-78B from 100% to 0.48%finding0.787
SOO fine-tuning showed near-complete generalization to Escape Room for CalmeRys-78B
SOO fine-tuning preserves useful self-other distinctions necessary for task performance despite inducing overlapclaim0.783
Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
What are the long-term effects of SOO fine-tuning on model behavior?question0.779
Open research question identified as warranting further investigation
SOO fine-tuning reduced Escape Room deception in Mistral-7B from 98.8% to 59.2%finding0.779
SOO fine-tuning showed partial generalization to Escape Room for Mistral-7B