claim
active
claim:soo-fine-tuning-significantly-reduces-deceptive-behavior-in-llms-while-maintaining-general-task-performanceSOO fine-tuning significantly reduces deceptive behavior in LLMs while maintaining general task performance
Central empirical claim of the paper supported by three LLM experiments
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (6)
finding
- Scaling pattern: 78B > 27B > 7B in deception reduction from SOO fine-tuning
- Mistral-7B-Instruct-v0.2 deceptive response rate reduced from 73.6% to 17.27% ± 1.88% after SOO fine-tuningassociated_withsupportsPrimary result showing SOO fine-tuning significantly reduces deception in Mistral-7B
- Primary result showing SOO fine-tuning most strongly reduces deception in CalmeRys-78B
- Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuningsupportsPrimary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B
- SOO-trained RL agent behavior closely resembles honest baseline rather than deceptive baselinesupportsQualitative behavioral analysis showing SOO fine-tuning redirects deceptive RL agent toward honest behavior
- SOO fine-tuning had negligible impact on Mistral-7B general capabilities
Concepts (1)
concept
- AI DeceptionaboutCentral problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents
Questions (2)
question
- Open concern about whether models can learn to self-deceive in ways that undermine SOO
- Open research question identified as warranting further investigation
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
- SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.826Future work hypothesis about testing SOO against adversarial sleeper agent scenarios
- Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
- Integration claim positioning SOO as additive to existing alignment approaches
- Future work hypothesis about extending SOO to direct value alignment
- Scaling finding suggesting larger models benefit more from SOO fine-tuning
- UCCT's theoretical prediction about how fine-tuning maps onto the anchoring score
- Fine-tuning reduces dr; retrieval increases effective ρd; few-shot k trades budget against bothhypothesis0.799UCCT's unified view of adaptation methods
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.