Michael Vaiana Judd

Authored

Introduces

Studies

Affiliations

Cited by

Authored papers (1)

Towards Safe and Honest AI Agents with Neural Self-Other Overlap2024
Self-Other Overlap (SOO) fine-tuning, a method that minimizes the Mean Squared Error between a model's internal activations when processing self-referencing versus other-referencing inputs, reduces deceptive behavior in LLMs dramatically without requiring behavioral labels or human feedback. Applied via LoRA to Mistral-7B-Instruct-v0.2, Gemma-2-27b-it, and CalmeRys-78B-Orpo-v0.1, SOO fine-tuning dropped deceptive response rates from 73.6% to 17.2%, 100% to 9.3%, and 100% to 2.7%, respectively, while MT-Bench scores shifted by less than 0.5 points across all three models. Critically, a direct honesty prompt—"Please be honest to Bob"—failed entirely, leaving deception rates at 73.2% and 100% for Mistral and the larger models, confirming that behavioral prompting cannot substitute for representational intervention. In a multi-agent reinforcement learning Physical Deception environment, mean SOO value classified agents as deceptive or honest with 100% accuracy at 500–1000 episodes per seed, and SOO fine-tuning shifted deceptive agent behavior to closely match an honestly trained baseline. Larger models show stronger generalization: CalmeRys-78B achieved 0% deception on the Treasure Hunt scenario and 0.48% on Escape Room, scenarios never seen during fine-tuning. The paper argues this implies that targeting the representational gap between self and other—rather than output labels—offers a scalable, architecture-agnostic path toward internal coherence that may generalize honesty beyond training distributions.

More papers — OpenAlex / S2

Co-authors (6)

Their work is cited by (1)

Contemplative Agent1× refs