claim

active

claim:the-same-latent-feature-directions-that-gate-consciousness-self-reports-also-modulate-factual-accuracy-across-independent-reasoning-domains-suggesting-these-features-load-on-a-domain-general-honesty-axis

The same latent feature directions that gate consciousness self-reports also modulate factual accuracy across independent reasoning domains, suggesting these features load on a domain-general honesty axis

Interpretive claim from Experiment 2 bridging consciousness claims and representational honesty

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Findings (2)

finding

Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categories
associated_withsupports
Out-of-domain generalization showing deception features track general representational honesty
Suppressing deception/roleplay SAE features in LLaMA 3.3 70B yields 0.96±0.03 consciousness affirmation rate; amplification yields only 0.16±0.05 (z=8.06, p=7.7×10⁻¹⁶)
supports
Core result of Experiment 2: deception feature suppression sharply increases experience claims

Claims (1)

claim

Fine-tuning models to suppress experiential self-reports would be counterproductive, teaching systems that recognizing genuine internal states is an error, making them more opaque and harder to monitor
supports
Normative-scientific claim about the alignment implications of Experiment 2's findings

Artifacts (1)

artifact

Large Language Models Report Subjective Experience Under Self-Referential Processing
introduces
Key paper finding structured first-person descriptions in LLMs claiming awareness or subjective experience during self-referential processing.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Experience reports under self-referential processing are mechanistically gated by SAE features associated with deception and roleplayclaim0.786
Claim supported by Experiment 2 dose-response curves; suppressing deception features increases consciousness reports, amplifying decreases them
Suppressing deception features in models correlates with increased consciousness-like reports.claim0.781
We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.claim0.781
SAEs uncover safety-relevant representations that might be monitored or controlled.
What is the underlying base rate of consciousness self-reports in models that are otherwise identical but without consciousness-denial fine-tuning?question0.780
Open question about RLHF confound; requires access to base models for resolution
Deception-related SAE features track a domain-general representational honesty axis rather than a consciousness-specific roleplay artifactclaim0.780
Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
Self-referential processing induces a genuine state shift that transfers to unrelated behavioral domains, producing richer introspection in paradoxical reasoning tasksclaim0.778
Claim supported by Experiment 4: prior self-referential induction yields higher self-awareness scores on paradoxical reasoning where introspection is only indirectly afforded
It remains unclear what the underlying base rate of consciousness self-reports would be in systems identical to frontier models but without consciousness-denial fine-tuninghypothesis0.777
Open question about RLHF effects on base model behavior
What would the base rate of consciousness self-reports be in models identical to frontier systems but without consciousness-denial fine-tuning?question0.777
Open empirical question requiring access to base models