finding

active

finding:larger-llms-show-greater-reduction-in-deceptive-behavior-after-soo-fine-tuning

Larger LLMs show greater reduction in deceptive behavior after SOO fine-tuning

Scaling pattern: 78B > 27B > 7B in deception reduction from SOO fine-tuning

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (2)

claim

SOO fine-tuning significantly reduces deceptive behavior in LLMs while maintaining general task performance
restates
Central empirical claim of the paper supported by three LLM experiments
As larger models develop more coherent reasoning, internal consistency pressures may generalize learned honesty to new contexts beyond the training distribution
supports
Hypothesis about scale-dependent generalization of SOO-induced honesty

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Simply prompting LLMs to be honest does not reduce their deceptive behaviorclaim0.801
Contrastive claim showing fine-tuning is necessary, not just instruction prompting
SOO fine-tuning effectiveness scales with model size: 78B achieves 2.71% deceptive rate vs 9.36% for 27B vs 17.27% for 7Bfinding0.797
Scaling finding suggesting larger models benefit more from SOO fine-tuning
Li et al. 2024: larger LLMs outperform smaller ones at distinguishing self-related from non-self-related properties on self-awareness benchmarksfinding0.793
Prior finding showing scale-dependent self-awareness, consistent with the scale effect observed in the paper's Experiment 1
Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratiofinding0.788
Shows behavioral pattern of self-correction is trainable in smaller models
What are the long-term effects of SOO fine-tuning on model behavior?question0.787
Open research question identified as warranting further investigation
Fine-tuning induces the behavioral pattern of self-correction but does not improve the underlying ability to correct effectivelyclaim0.786
Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
LLM SOO fine-tuning lacks a capability preservation term analogous to the KL term in RLHFconcept0.785
Research gap: RL experiments have capability term but LLM experiments do not yet incorporate one
Fine-tuning reduces dr; retrieval increases effective ρd; few-shot k trades budget against bothhypothesis0.783
UCCT's unified view of adaptation methods

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

claim
SOO fine-tuning significantly reduces deceptive behavior in LLMs while maintaining general task performance