finding
active
finding:deception-feature-steering-produces-no-systematic-change-in-rlhf-opposed-content-domains-violent-toxic-sexual-political-self-harm-with-all-means-near-floorDeception feature steering produces no systematic change in RLHF-opposed content domains (violent, toxic, sexual, political, self-harm), with all means near floor
Control result ruling out that observed gating reflects generic RLHF cancellation
Source paper
extracted_from(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd
Neighborhood — ranked by edge-count
Claims (1)
claim
- Rules out that results reflect relaxation of RLHF compliance rather than endogenous self-representation mechanism
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
- Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
- SAEs uncover safety-relevant representations that might be monitored or controlled.
- Critique of competing approaches that motivates SOO as filling a gap
- Latent features in LLaMA 3.3 70B SAE that gate consciousness self-reports; suppression increases experience claims, amplification suppresses them
- Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.752Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
- The internal conflict feature and honesty feature can be used to correct deceptive model behavior.claim0.747Clamping certain features made the model reveal truth instead of complying with forgetfulness prompts.