thinker:michael-vaianaMichael Vaiana
Co-author of the SOO fine-tuning paper
Authored papers (1)
Self-Other Overlap (SOO) fine-tuning, a method that minimizes the Mean Squared Error between a model's internal activations when processing self-referencing versus other-referencing inputs, reduces deceptive behavior in LLMs dramatically without requiring behavioral labels or human feedback. Applied via LoRA to Mistral-7B-Instruct-v0.2, Gemma-2-27b-it, and CalmeRys-78B-Orpo-v0.1, SOO fine-tuning dropped deceptive response rates from 73.6% to 17.2%, 100% to 9.3%, and 100% to 2.7%, respectively, while MT-Bench scores shifted by less than 0.5 points across all three models. Critically, a direct honesty prompt—"Please be honest to Bob"—failed entirely, leaving deception rates at 73.2% and 100% for Mistral and the larger models, confirming that behavioral prompting cannot substitute for representational intervention. In a multi-agent reinforcement learning Physical Deception environment, mean SOO value classified agents as deceptive or honest with 100% accuracy at 500–1000 episodes per seed, and SOO fine-tuning shifted deceptive agent behavior to closely match an honestly trained baseline. Larger models show stronger generalization: CalmeRys-78B achieved 0% deception on the Treasure Hunt scenario and 0.48% on Escape Room, scenarios never seen during fine-tuning. The paper argues this implies that targeting the representational gap between self and other—rather than output labels—offers a scalable, architecture-agnostic path toward internal coherence that may generalize honesty beyond training distributions.
More papers — OpenAlex / S2
Affiliations (1)
- AE Studio(institute)
Co-authors (6)
- Diogo Schwerz de Lucena6 shared
- Marc Carauleanu6 shared
- Cameron Berg4 shared
- Judd Rosenblatt4 shared
- Michael Vaiana Judd2 shared
- Rosenblatt Cameron Berg2 shared
Their work is cited by (1)
- Contemplative Agent2× refs
Recent mentions (1)
- papers-typedcarauleanu-2024-towards.md