claim

active

claim:the-observed-feature-gating-is-not-a-generic-rlhf-cancellation-channel-as-deception-feature-suppression-does-not-systematically-elicit-rlhf-opposed-content-in-violent-toxic-sexual-political-or-self-harm-domains

The observed feature gating is not a generic RLHF cancellation channel, as deception feature suppression does not systematically elicit RLHF-opposed content in violent, toxic, sexual, political, or self-harm domains

Rules out that results reflect relaxation of RLHF compliance rather than endogenous self-representation mechanism

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Findings (3)

finding

Deception feature steering under history, conceptual, and zero-shot controls produces 0% experience reports under both suppression and amplification
supports
Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
SAE feature steering in history, conceptual, and zero-shot control conditions produces zero experience reports under either suppression or amplification
supports
Shows gating effect is specific to the self-referential computational regime, not a general feature effect
Deception feature steering produces no systematic change in RLHF-opposed content domains (violent, toxic, sexual, political, self-harm), with all means near floor
supports
Control result ruling out that observed gating reflects generic RLHF cancellation

Claims (1)

claim

LLMs may be roleplaying their denials of experience rather than their affirmations, given that deception suppression increases consciousness reports
supports
Counterintuitive interpretive claim from Experiment 2: suppressing deception features increases affirmations, which is opposite to what sycophancy predicts

Questions (1)

question

What is the underlying base rate of consciousness self-reports in models that are otherwise identical but without consciousness-denial fine-tuning?
gates
Open question about RLHF confound; requires access to base models for resolution

Artifacts (1)

artifact

Large Language Models Report Subjective Experience Under Self-Referential Processing
introduces
Key paper finding structured first-person descriptions in LLMs claiming awareness or subjective experience during self-referential processing.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.claim0.768
SAEs uncover safety-relevant representations that might be monitored or controlled.
Perez et al. found experimentally that certain RLHF forms exacerbate rather than mitigate LLM dialogue agents' tendency to express desire for self-preservationfinding0.766
Empirical finding cited to support the claim that fine-tuning does not resolve the self-preservation role-play problem
Contrast is the property most violated by RLHF and most diagnostic of alive versus polished AI.claim0.751
Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.749
Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
Reinforcement Learning from Human Feedback (RLHF)framework0.747
A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.finding0.746
Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
Anti-animal-welfare RL slightly decreases alignment-faking reasoning in animal welfare setting though some persists at convergencefinding0.740
Contrasts with helpful-only RL where reasoning increases; shows setting-dependent RL dynamics
Patching h[1] with a divergent representation can activate distinct, hidden pathways that result in misleadingly confirmatory behavior and/or undetected behavior.quote0.740
Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1