finding

active

finding:deception-feature-steering-produces-no-systematic-change-in-rlhf-opposed-content-domains-violent-toxic-sexual-political-self-harm-with-all-means-near-floor

Deception feature steering produces no systematic change in RLHF-opposed content domains (violent, toxic, sexual, political, self-harm), with all means near floor

Control result ruling out that observed gating reflects generic RLHF cancellation

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Claims (1)

claim

The observed feature gating is not a generic RLHF cancellation channel, as deception feature suppression does not systematically elicit RLHF-opposed content in violent, toxic, sexual, political, or self-harm domains
supports
Rules out that results reflect relaxation of RLHF compliance rather than endogenous self-representation mechanism

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Deception feature steering under history, conceptual, and zero-shot controls produces 0% experience reports under both suppression and amplificationfinding0.801
Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
Deception-related SAE features track a domain-general representational honesty axis rather than a consciousness-specific roleplay artifactclaim0.793
Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.claim0.772
SAEs uncover safety-relevant representations that might be monitored or controlled.
RLHF and Constitutional AI face challenges distinguishing truthfulness (output accuracy) from honesty (alignment of outputs with internal beliefs)claim0.764
Critique of competing approaches that motivates SOO as filling a gap
Contrast is the property most violated by RLHF and most diagnostic of alive versus polished AI.claim0.763
Deception- and Roleplay-Related SAE Featuresconcept0.759
Latent features in LLaMA 3.3 70B SAE that gate consciousness self-reports; suppression increases experience claims, amplification suppresses them
Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.752
Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
The internal conflict feature and honesty feature can be used to correct deceptive model behavior.claim0.747
Clamping certain features made the model reveal truth instead of complying with forgetfulness prompts.