claim

active

claim:by-reducing-self-other-distinctions-during-safety-training-soo-could-make-it-harder-for-a-model-to-maintain-adversarial-or-deceptive-representations

By reducing self-other distinctions during safety training, SOO could make it harder for a model to maintain adversarial or deceptive representations

Mechanistic explanation for why SOO reduces deception

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Papers (1)

paper

Towards Safe and Honest AI Agents with Neural Self-Other Overlap
introduces

Findings (3)

finding

CalmeRys-78B Latent SOO MSE reduced from 0.593 to 0.315 ± 0.017 after SOO fine-tuning
supports
SOO fine-tuning produced stronger reduction in latent SOO in CalmeRys-78B
Gemma-2-27B attention layer Latent SOO MSE reduced from 11 to 7.67 ± 0.77 after SOO fine-tuning
supports
SOO fine-tuning reduced attention layer MSE in Gemma-2-27B though MLP layers showed no significant change
Mistral-7B Latent SOO MSE reduced from 0.107 to 0.078 ± 0.001 after SOO fine-tuning
supports
SOO fine-tuning reduced the MSE between self and other activations in Mistral-7B MLP layers

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We define Self-Other Overlap (SOO) as the extent to which a model exhibits similar internal representations when reasoning about itself and others in similar contexts.quote0.789
Formal definition of the paper's central construct
Neural self-other overlap in humans mediates empathy and inversely predicts deceptive behavior, motivating the SOO approach for AIclaim0.784
Cross-domain analogical claim linking neuroscience findings to AI design
SOO fine-tuning preserves useful self-other distinctions necessary for task performance despite inducing overlapclaim0.783
Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
Self-Deception Risk in SOOconcept0.779
Concern that models engaging in self-deception could reduce effectiveness of SOO fine-tuning
Behavioral evidence from closed-weight models cannot definitively rule out that self-reports reflect training artifacts or sophisticated simulation rather than genuine self-awarenessclaim0.772
Primary limitation acknowledged by the authors; strongest evidence would require mechanistic activation analysis
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)concept0.769
Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
SOO fine-tuning significantly reduces deceptive behavior in LLMs while maintaining general task performanceclaim0.762
Central empirical claim of the paper supported by three LLM experiments
Self-Other Overlap (SOO) Fine-Tuningframework0.759
The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior