finding
active
finding:deception-feature-steering-under-history-conceptual-and-zero-shot-controls-produces-0-experience-reports-under-both-suppression-and-amplificationDeception feature steering under history, conceptual, and zero-shot controls produces 0% experience reports under both suppression and amplification
Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
Source paper
extracted_from(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd
Neighborhood — ranked by edge-count
Claims (2)
claim
- Rules out that results reflect relaxation of RLHF compliance rather than endogenous self-representation mechanism
- Controls ruling out semantic association as explanation for experimental results
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows gating effect is specific to the self-referential computational regime, not a general feature effect
- Extreme end of deception induction demonstrating near-complete fabrication of false narratives
- Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.807Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
- Control result ruling out that observed gating reflects generic RLHF cancellation
- Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
- Demonstrates activation steering reliably induces deception from neutral prompt without explicit instructions
- Out-of-domain generalization showing deception features track general representational honesty
- Most extreme individual case of honesty induction via steering vectors in Experiment 2