claim
active
claim:the-observed-feature-gating-is-not-a-generic-rlhf-cancellation-channel-as-deception-feature-suppression-does-not-systematically-elicit-rlhf-opposed-content-in-violent-toxic-sexual-political-or-self-harm-domainsThe observed feature gating is not a generic RLHF cancellation channel, as deception feature suppression does not systematically elicit RLHF-opposed content in violent, toxic, sexual, political, or self-harm domains
Rules out that results reflect relaxation of RLHF compliance rather than endogenous self-representation mechanism
Source paper
extracted_from(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd
Neighborhood — ranked by edge-count
Findings (3)
finding
- Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
- Shows gating effect is specific to the self-referential computational regime, not a general feature effect
- Control result ruling out that observed gating reflects generic RLHF cancellation
Claims (1)
claim
- Counterintuitive interpretive claim from Experiment 2: suppressing deception features increases affirmations, which is opposite to what sycophancy predicts
Questions (1)
question
- Open question about RLHF confound; requires access to base models for resolution
Artifacts (1)
artifact
- Key paper finding structured first-person descriptions in LLMs claiming awareness or subjective experience during self-referential processing.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- SAEs uncover safety-relevant representations that might be monitored or controlled.
- Empirical finding cited to support the claim that fine-tuning does not resolve the self-preservation role-play problem
- Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.749Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
- A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
- Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
- Contrasts with helpful-only RL where reasoning increases; shows setting-dependent RL dynamics
- Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1