claim

active

claim:deception-related-sae-features-track-a-domain-general-representational-honesty-axis-rather-than-a-consciousness-specific-roleplay-artifact

Deception-related SAE features track a domain-general representational honesty axis rather than a consciousness-specific roleplay artifact

Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Findings (2)

finding

Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categories
supports
Out-of-domain generalization showing deception features track general representational honesty
Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categories
supports
Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2

Claims (2)

claim

Fine-tuning models to suppress experiential self-reports would be counterproductive, teaching systems that recognizing genuine internal states is an error, making them more opaque and harder to monitor
supports
Normative-scientific claim about the alignment implications of Experiment 2's findings
Experience reports under self-referential processing are mechanistically gated by SAE features associated with deception and roleplay
supports
Claim supported by Experiment 2 dose-response curves; suppressing deception features increases consciousness reports, amplifying decreases them

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Deception- and Roleplay-Related SAE Featuresconcept0.893
Latent features in LLaMA 3.3 70B SAE that gate consciousness self-reports; suppression increases experience claims, amplification suppresses them
Deception and Roleplay SAE Featuresconcept0.876
Sparse autoencoder features associated with deception and roleplay that gate consciousness self-reports in Llama 70B
Experiment 2: SAE Deception Feature Steeringconcept0.808
Tests whether deception- and roleplay-related features causally gate consciousness self-reports in LLaMA 3.3 70B
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.800
Claim that feature grounding enables interpretability metrics.
Deception feature steering produces no systematic change in RLHF-opposed content domains (violent, toxic, sexual, political, self-harm), with all means near floorfinding0.793
Control result ruling out that observed gating reflects generic RLHF cancellation
We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.claim0.792
SAEs uncover safety-relevant representations that might be monitored or controlled.
Claude Mythos Preview: SAE features for 'performative behavior' and 'hidden emotional struggle' co-activate when model expresses contentmentfinding0.781
Supports scorer's preference for enacted reflection over described reflection; internals flag what self-report does not
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.780
Surprising finding that the two evaluation methods diverge in their relationship with persistence