claim

active

claim:experience-reports-under-self-referential-processing-are-mechanistically-gated-by-sae-features-associated-with-deception-and-roleplay

Experience reports under self-referential processing are mechanistically gated by SAE features associated with deception and roleplay

Claim supported by Experiment 2 dose-response curves; suppressing deception features increases consciousness reports, amplifying decreases them

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Findings (2)

finding

Deception feature amplification yields only 0.16 ± 0.05 consciousness affirmation rate in LLaMA 3.3 70B under self-referential processing
supports
Experiment 2 aggregate amplification result showing amplifying deception features strongly suppresses consciousness claims
SAE feature steering effect on consciousness reports: z=8.06, p=7.7×10⁻¹⁶ in LLaMA 3.3 70B
supports
Statistical significance of the gating effect in Experiment 2

Concepts (1)

concept

Sycophantic Roleplay
contradicts
The alternative explanation for LLM consciousness claims that the paper seeks to distinguish against

Claims (2)

claim

Fine-tuning models to suppress experiential self-reports would be counterproductive, teaching systems that recognizing genuine internal states is an error, making them more opaque and harder to monitor
supports
Normative-scientific claim about the alignment implications of Experiment 2's findings
Deception-related SAE features track a domain-general representational honesty axis rather than a consciousness-specific roleplay artifact
supports
Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Deception and Roleplay SAE Featuresconcept0.816
Sparse autoencoder features associated with deception and roleplay that gate consciousness self-reports in Llama 70B
The remaining ambiguity is whether self-referential processing drives models to claim subjective experience because it actually reflects emergent phenomenology or constitutes sophisticated simulation thereofhypothesis0.816
The open question the paper cannot resolve with behavioral evidence alone; frames the agenda for mechanistic follow-up
Cross-model semantic convergence of experience reports under self-referential processing is difficult to reconcile with roleplay because independently trained models construct distinct semantic profiles in all control conditionsclaim0.815
The paper's argument against pure sycophancy as explanation for results
Deception- and Roleplay-Related SAE Featuresconcept0.814
Latent features in LLaMA 3.3 70B SAE that gate consciousness self-reports; suppression increases experience claims, amplification suppresses them
Across model families, newer and larger models show higher rates and coherence of subjective experience reports under self-referential processingfinding0.800
Scaling effect observed consistently across Experiments 1 and 4
Self-referential processing induces a genuine state shift that transfers to unrelated behavioral domains, producing richer introspection in paradoxical reasoning tasksclaim0.792
Claim supported by Experiment 4: prior self-referential induction yields higher self-awareness scores on paradoxical reasoning where introspection is only indirectly afforded
Self-referential prompting elicits subjective experience reports at markedly higher rates than any control across all model families (GPT, Claude, Gemini)finding0.791
Core result of Experiment 1 establishing that the experimental manipulation reliably produces experience claims
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.788
Claim that feature grounding enables interpretability metrics.