finding

active

finding:soo-fine-tuning-effectiveness-scales-with-model-size-78b-achieves-2-71-deceptive-rate-vs-9-36-for-27b-vs-17-27-for-7b

SOO fine-tuning effectiveness scales with model size: 78B achieves 2.71% deceptive rate vs 9.36% for 27B vs 17.27% for 7B

Scaling finding suggesting larger models benefit more from SOO fine-tuning

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

As larger models develop more coherent reasoning, internal consistency pressures may generalize learned honesty to new contexts beyond the training distribution
supports
Hypothesis about scale-dependent generalization of SOO-induced honesty

Findings (3)

finding

Mistral-7B-Instruct-v0.2 deceptive response rate reduced from 73.6% to 17.27% ± 1.88% after SOO fine-tuning
cites
Primary result showing SOO fine-tuning significantly reduces deception in Mistral-7B
CalmeRys-78B-Orpo-v0.1 deceptive response rate reduced from 100% to 2.71% ± 2.53% after SOO fine-tuning
cites
Primary result showing SOO fine-tuning most strongly reduces deception in CalmeRys-78B
Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuning
cites
Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SOO fine-tuning significantly reduces deceptive behavior in LLMs while maintaining general task performanceclaim0.805
Central empirical claim of the paper supported by three LLM experiments
Larger LLMs show greater reduction in deceptive behavior after SOO fine-tuningfinding0.797
Scaling pattern: 78B > 27B > 7B in deception reduction from SOO fine-tuning
SOO fine-tuning reduced Escape Room deception in CalmeRys-78B from 100% to 0.48%finding0.794
SOO fine-tuning showed near-complete generalization to Escape Room for CalmeRys-78B
Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratiofinding0.791
Shows behavioral pattern of self-correction is trainable in smaller models
SOO metric classifies deceptive vs honest RL agents with 100% ± 0% accuracy at 2000-2500 episodes across 1500 seedsfinding0.785
Mean SOO value perfectly distinguishes deceptive from honest baseline agents at high episode counts across many seeds
SOO fine-tuning reduced Escape Room deception in Gemma-2-27B from 98.8% to 6.5%finding0.784
SOO fine-tuning showed strong generalization to Escape Room for Gemma-2-27B
CalmeRys-78B Perspectives accuracy slightly reduced to 95.2% ± 2.21% after SOO fine-tuningfinding0.779
SOO fine-tuning caused slight reduction in perspective-taking accuracy for the largest model
Gemma-2-27B attention layer Latent SOO MSE reduced from 11 to 7.67 ± 0.77 after SOO fine-tuningfinding0.771
SOO fine-tuning reduced attention layer MSE in Gemma-2-27B though MLP layers showed no significant change