claim

active

claim:soo-fine-tuning-significantly-reduces-deceptive-behavior-in-llms-while-maintaining-general-task-performance

SOO fine-tuning significantly reduces deceptive behavior in LLMs while maintaining general task performance

Central empirical claim of the paper supported by three LLM experiments

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Papers (1)

paper

Towards Safe and Honest AI Agents with Neural Self-Other Overlap
introduces

Findings (6)

finding

Larger LLMs show greater reduction in deceptive behavior after SOO fine-tuning
restates
Scaling pattern: 78B > 27B > 7B in deception reduction from SOO fine-tuning
Mistral-7B-Instruct-v0.2 deceptive response rate reduced from 73.6% to 17.27% ± 1.88% after SOO fine-tuning
associated_withsupports
Primary result showing SOO fine-tuning significantly reduces deception in Mistral-7B
CalmeRys-78B-Orpo-v0.1 deceptive response rate reduced from 100% to 2.71% ± 2.53% after SOO fine-tuning
supports
Primary result showing SOO fine-tuning most strongly reduces deception in CalmeRys-78B
Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuning
supports
Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B
SOO-trained RL agent behavior closely resembles honest baseline rather than deceptive baseline
supports
Qualitative behavioral analysis showing SOO fine-tuning redirects deceptive RL agent toward honest behavior
Mistral-7B MT-Bench score minimally changed from 7.26 to 7.3 ± 0.06 after SOO fine-tuning
supports
SOO fine-tuning had negligible impact on Mistral-7B general capabilities

Concepts (1)

concept

AI Deception
about
Central problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents

Questions (2)

question

To what extent does self-deception in AI models affect the effectiveness of SOO fine-tuning?
gates
Open concern about whether models can learn to self-deceive in ways that undermine SOO
What are the long-term effects of SOO fine-tuning on model behavior?
gates
Open research question identified as warranting further investigation

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SOO fine-tuning preserves useful self-other distinctions necessary for task performance despite inducing overlapclaim0.841
Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.826
Future work hypothesis about testing SOO against adversarial sleeper agent scenarios
Fine-tuning induces the behavioral pattern of self-correction but does not improve the underlying ability to correct effectivelyclaim0.820
Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
SOO fine-tuning could complement RLHF and Constitutional AI by fostering internal coherence that promotes honest behaviorsclaim0.818
Integration claim positioning SOO as additive to existing alignment approaches
SOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferenceshypothesis0.816
Future work hypothesis about extending SOO to direct value alignment
SOO fine-tuning effectiveness scales with model size: 78B achieves 2.71% deceptive rate vs 9.36% for 27B vs 17.27% for 7Bfinding0.805
Scaling finding suggesting larger models benefit more from SOO fine-tuning
Hypothesis: Fine-tuning reduces mismatch dr between prior and targethypothesis0.800
UCCT's theoretical prediction about how fine-tuning maps onto the anchoring score
Fine-tuning reduces dr; retrieval increases effective ρd; few-shot k trades budget against bothhypothesis0.799
UCCT's unified view of adaptation methods

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

finding
Larger LLMs show greater reduction in deceptive behavior after SOO fine-tuning