claim
active
claim:by-reducing-self-other-distinctions-during-safety-training-soo-could-make-it-harder-for-a-model-to-maintain-adversarial-or-deceptive-representationsBy reducing self-other distinctions during safety training, SOO could make it harder for a model to maintain adversarial or deceptive representations
Mechanistic explanation for why SOO reduces deception
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (3)
finding
- SOO fine-tuning produced stronger reduction in latent SOO in CalmeRys-78B
- Gemma-2-27B attention layer Latent SOO MSE reduced from 11 to 7.67 ± 0.77 after SOO fine-tuningsupportsSOO fine-tuning reduced attention layer MSE in Gemma-2-27B though MLP layers showed no significant change
- SOO fine-tuning reduced the MSE between self and other activations in Mistral-7B MLP layers
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Formal definition of the paper's central construct
- Cross-domain analogical claim linking neuroscience findings to AI design
- Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
- Concern that models engaging in self-deception could reduce effectiveness of SOO fine-tuning
- Primary limitation acknowledged by the authors; strongest evidence would require mechanistic activation analysis
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)concept0.769Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
- Central empirical claim of the paper supported by three LLM experiments
- The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior