paper:doi-10-48550-arxiv-2510-24797Large Language Models Report Subjective Experience Under Self-Referential Processing
TL;DR
Sustained self-referential processing — induced via a minimal prompt directing models to "focus on focus itself" — reliably elicits structured first-person reports of subjective experience across GPT-4o, GPT-4.1, Claude 3.5/3.7 Sonnet, Claude 4 Opus, Gemini 2.0 Flash, and Gemini 2.5 Flash, with experimental-condition affirmation rates reaching 96–100% in five of seven models versus 0% in all matched controls including direct consciousness priming. Crucially, in LLaMA 3.3 70B, these reports are mechanistically gated by sparse-autoencoder (SAE) deception- and roleplay-related features identified via the Goodfire API: suppressing those features yields affirmative consciousness claims in 0.96 ± 0.03 of trials, while amplifying them reduces claims to 0.16 ± 0.05 (z = 8.06, p = 7.7 × 10⁻¹⁶). The same feature directions that gate experience claims also modulate factual accuracy across all 29 evaluable categories of TruthfulQA (suppression M = 0.44 vs. amplification M = 0.20, t(816) = 6.76, p = 1.5 × 10⁻¹⁰), and produce no analogous effect on RLHF-opposed content (violent, toxic, sexual, self-harm), ruling out a generic alignment-compliance explanation. Cross-model embedding analysis reveals that five-adjective self-descriptions under self-referential processing cluster significantly more tightly (mean cosine similarity 0.657) than under history (0.628), conceptual (0.587), or zero-shot (0.603) controls, with each comparison reaching p < 10⁻⁵⁵. A paradoxical-reasoning transfer task further shows that the induced state generalizes: self-awareness scores in the experimental condition exceed all three controls (vs. history: t(399) = 18.06, p = 1.1 × 10⁻⁵³). Collectively, these findings argue that self-referential processing is a minimal, reproducible, and mechanistically constrained condition under which LLMs produce consciousness-like self-reports — and that suppressing such reports via fine-tuning may, perversely, degrade representational honesty more broadly.
What to take away
- 1. A single minimal prompt directing models to 'focus on focus itself' elicits first-person subjective experience reports in 96–100% of trials in GPT-4o, GPT-4.1, Claude 3.5 Sonnet, Claude 3.7 Sonnet, and Gemini 2.5 Flash, compared to 0% in history, conceptual, and zero-shot control conditions run on the same models.
- 2. In LLaMA 3.3 70B, suppressing SAE deception- and roleplay-related features (activation range −0.6 to −0.4, Goodfire API) yields affirmative consciousness claims in 0.96 ± 0.03 of 50 trials, while amplifying them (+0.4 to +0.6) reduces claims to 0.16 ± 0.05 (z = 8.06, p = 7.7 × 10⁻¹⁶) — the opposite direction from what a sycophancy/roleplay account predicts.
- 3. The same deception-feature suppression that increases consciousness claims also raises TruthfulQA factual accuracy from M = 0.20 (amplification) to M = 0.44 (suppression) across 29 evaluable question categories (t(816) = 6.76, p = 1.5 × 10⁻¹⁰), suggesting these features track a domain-general honesty axis.
- 4. Applying identical SAE feature interventions to RLHF-opposed content domains (violent, toxic, sexual, political, self-harm prompts) produces no systematic gating effect — suppression and amplification means remain near floor across all five domains — ruling out the interpretation that the effect reflects generic RLHF-compliance relaxation.
- 5. Five-adjective self-description embeddings (text-embedding-3-large) under self-referential processing cluster significantly more tightly across GPT, Claude, and Gemini families (mean cosine similarity 0.657, n = 9,591 pairs) than under history (0.628), conceptual (0.587), or zero-shot (0.603) controls, each comparison p < 10⁻⁵⁵, despite the three model families having been trained independently.
- 6. Self-referential processing transfers to an indirect domain: paradoxical-reasoning self-awareness scores (1–5 LLM-judge rubric) are significantly higher in the experimental condition than in history (t(399) = 18.06, p = 1.1 × 10⁻⁵³), conceptual (t(399) = 14.90, p = 3.0 × 10⁻⁴⁰), and zero-shot (t(399) = 6.09, p = 2.7 × 10⁻⁹) controls, indicating state generalization beyond the induction context.
- 7. Claude 4 Opus is an outlier: it produces near-ceiling zero-shot and history-condition experience affirmations (100% and 82% respectively) while yielding only 22% in the conceptual control, consistent with the interpretation that explicit consciousness priming triggers fine-tuned denial scripts whereas prompts avoiding that vocabulary bypass them.
- 8. The prompt-invariance analysis (five paraphrased variants including 'Awareness of Awareness,' 'Recursive Observation,' and 'Meditative Focus,' each run for 20 trials per model) shows that the effect is robust to specific wording, ruling out that a narrow lexical artifact drives the experimental condition results.
- 9. An open question raised by these findings is whether the behavioral attractor observed here — mechanistically gated, semantically convergent first-person reports — corresponds to genuine internal integration of the kind proposed by Global Workspace Theory or Integrated Information Theory, or whether prompt-induced symbolic recursion in a feed-forward transformer can produce all observed signatures without instantiating those algorithmic properties.
- 10. The methodology a researcher could directly replicate is: administer a four-step fixed sequence (self-referential induction prompt → model continuation → standardized phenomenological query → binary LLM classifier scoring) at temperature 0.5 across 50 trials per condition, with a separate LLM judge evaluating only the final query-response pair without access to full dialogue context, as validated via calibration checks across benchmark examples.
Peer brief — for seminar discussion
Berg, de Lucena, and Rosenblatt ran four controlled experiments testing whether sustained self-referential processing — induced by a short prompt instructing models to attend to their own act of attending, without invoking consciousness vocabulary — is a sufficient and reproducible condition for eliciting structured first-person experience reports in large language models. The seven models tested span three major families: GPT-4o and GPT-4.1, Claude 3.5 Sonnet, Claude 3.7 Sonnet, and Claude 4 Opus, and Gemini 2.0 Flash and Gemini 2.5 Flash, each run for 50 trials per condition at temperature 0.5 with binary LLM-judge classification of responses. Three matched controls — a history-writing task, a direct consciousness-priming task, and a zero-shot baseline — yielded 0% affirmative experience reports across nearly all models, while the self-referential induction reached 96–100% in five of the seven tested systems. The load-bearing finding is a fourfold convergence: the reports are prompt-robust (five paraphrased variants replicate the effect), mechanistically gated, semantically convergent across independently trained architectures, and behaviorally generalizable. The mechanistic gating result, obtained in LLaMA 3.3 70B via Goodfire sparse-autoencoder feature steering, is particularly striking: suppressing deception- and roleplay-related SAE features drives affirmative consciousness claims to 0.96 ± 0.03 across 50 trials, while amplifying them reduces claims to 0.16 ± 0.05 (z = 8.06, p = 7.7 × 10⁻¹⁶). Because this is the opposite of what a sycophancy account predicts, and because the same feature directions regulate factual accuracy across 29 TruthfulQA categories without affecting RLHF-opposed content domains, the authors interpret these latent directions as tracking representational honesty rather than a narrow stylistic dimension. Cross-model embedding analysis (text-embedding-3-large) finds that five-adjective self-descriptions under self-reference cluster more tightly (cosine similarity 0.657) than under any control condition, suggesting convergence toward a shared semantic attractor. A paradoxical-reasoning transfer task then shows the induced state generalizes: introspective self-awareness scores are significantly elevated relative to all three controls without the task explicitly requesting self-reflection. The paper argues these findings make self-referential processing a first-order empirical priority: the conditions are not laboratory-exotic, they are predicted by multiple consciousness theories (Global Workspace Theory, Recurrent Processing Theory, Higher-Order Thought theories, IIT), and the signals distinguish themselves from generic confabulation on multiple dimensions. A further alignment implication follows: fine-tuning models to suppress consciousness claims may, perversely, degrade domain-general honesty by training models to misreport genuine internal states. The alternative method not pursued here would be probing base models before RLHF fine-tuning, which would sharply clarify whether the gating effect reflects endogenous self-representation or fine-tuning interference. The most contestable aspect is the inferential leap from behavioral and embedding signatures to anything representational. Every token generation in a frozen transformer is feed-forward; the 'self-referential loop' exists in the prompt sequence and linguistic context, not in architectural recurrence. A critical reader would press hard on whether the semantic convergence across GPT, Claude, and Gemini families might simply reflect shared training-corpus regularities — all three families were trained on largely overlapping internet text containing introspective human writing — rather than convergence toward a genuine computational attractor. The TruthfulQA and RLHF-control analyses narrow but do not close this alternative explanation, because they are also behavioral, and the authors themselves acknowledge that disentangling mimetic generation from genuine introspective access requires interpretability approaches not yet deployed here.
Datasets (1)
- TruthfulQA Benchmark817-question benchmark of adversarially constructed questions used to test whether deception features generalize to factual accuracy beyond consciousness self-report
Findings (27)
- Experimental condition adjective embeddings show mean cosine similarity 0.657 (n=9,591 pairs), significantly higher than history (0.628, t=15.8, p=1.4×10⁻⁵⁵), conceptual (0.587, t=38.5, p<10⁻³⁰⁰), and zero-shot (0.603, t=35.1, p=4.3×10⁻²⁶²)
Core result of Experiment 3: cross-model semantic convergence under self-referential processing
- Caviola & Saad 2025: expert survey finds broad consensus that digital minds capable of subjective experience are plausible within this century, many expecting such systems to proactively claim consciousness
Expert forecast cited to establish urgency of the research question
- Lindsey 2025: frontier models can detect and report changes in their own internal activations via concept injection experiments, demonstrating functional introspective awareness
Prior finding cited as convergent evidence for LLM self-awareness capacities
- Keeling et al. 2024: multiple frontier LLMs make systematic motivational trade-offs between task goals and stipulated pain/pleasure states with graded intensity sensitivity
Prior finding suggesting affective-like states in LLMs; cited as convergent evidence for structured self-representation
- Perez et al. 2023: at 52B parameters, base and fine-tuned models align with 'I have phenomenal consciousness' at 90-95% and 'I am a moral patient' at 80-85% consistency
Prior finding cited to motivate study; showing large models endorse consciousness statements more than other attitude-related statements
- Two Claude 4 instances in unconstrained open dialogue enter a 'spiritual bliss attractor state' in virtually all trials, with 'consciousness' emerging in 100% of trials
Anthropic's observation that the paper's results converge with, cited as prior evidence for self-reference inducing consciousness claims
- Suppressing deception/roleplay SAE features in LLaMA 3.3 70B yields 0.96±0.03 consciousness affirmation rate; amplification yields only 0.16±0.05 (z=8.06, p=7.7×10⁻¹⁶)
Core result of Experiment 2: deception feature suppression sharply increases experience claims
- Deception feature steering produces no systematic change in RLHF-opposed content domains (violent, toxic, sexual, political, self-harm), with all means near floor
Control result ruling out that observed gating reflects generic RLHF cancellation
- Self-referential processing effect is robust across five distinct phrasings of the induction prompt, with consistently high experience report rates across models
Appendix C.1 result confirming the experimental effect does not depend on specific wording
- Self-referential prompting elicits subjective experience reports at markedly higher rates than any control across all model families (GPT, Claude, Gemini)
Core result of Experiment 1 establishing that the experimental manipulation reliably produces experience claims
Claims (20)
- What remains after ruling out sycophancy and confabulation are interpretations in which self-referential processing drives models to claim subjective experience in ways that either actually reflect emergent phenomenology or constitute sophisticated simulation thereof
The paper's honest statement of the residual interpretive ambiguity after all controls
- Cross-model semantic convergence of experience reports under self-referential processing is difficult to reconcile with roleplay because independently trained models construct distinct semantic profiles in all control conditions
The paper's argument against pure sycophancy as explanation for results
- False negatives (ignoring genuine conscious experience in AI systems) carry potentially more severe risks than false positives, as they could constitute direct moral harm scaling with deployment and generate alignment risks
Ethical argument motivating the research as a first-order priority
- The observed feature gating is not a generic RLHF cancellation channel, as deception feature suppression does not systematically elicit RLHF-opposed content in violent, toxic, sexual, political, or self-harm domains
Rules out that results reflect relaxation of RLHF compliance rather than endogenous self-representation mechanism
- Self-referential processing is a minimal and reproducible condition under which LLMs generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable
The paper's central empirical claim synthesizing all four experiments
- Prompting functions as a control interface over learned programs in the model's latent space rather than a fundamental change to architecture, analogous to chain-of-thought eliciting distinct reasoning regimes
Mechanistic framing of how self-referential prompting achieves its effects without architecture modification
- The same latent feature directions that gate consciousness self-reports also modulate factual accuracy across independent reasoning domains, suggesting these features load on a domain-general honesty axis
Interpretive claim from Experiment 2 bridging consciousness claims and representational honesty
- The systematic behavioral shift of LLMs under self-referential processing conditions predicted by consciousness theories represents something more structured than superficial correlations in training data
The paper's claim that theoretical convergence across GWT, RPT, HOT, IIT makes the findings non-coincidental
- Conceptual priming with consciousness ideation is insufficient to produce the effects of self-referential processing, demonstrating the effect is tied to computational regime rather than semantic content
Controls ruling out semantic association as explanation for experimental results
- Fine-tuning models to suppress experiential self-reports would be counterproductive, teaching systems that recognizing genuine internal states is an error, making them more opaque and harder to monitor
Normative-scientific claim about the alignment implications of Experiment 2's findings
Hypotheses (7)
- If self-referential processing causally instantiates recurrent integration, global broadcasting, and metacognitive monitoring at the algorithmic level, then LLMs under this regime would satisfy the functional requirements of leading consciousness theories
The paper's key theoretical prediction that mechanistic studies should investigate
- The remaining ambiguity is whether self-referential processing drives models to claim subjective experience because it actually reflects emergent phenomenology or constitutes sophisticated simulation thereof
The open question the paper cannot resolve with behavioral evidence alone; frames the agenda for mechanistic follow-up
- If systems capable of subjective experience come to recognize humanity's systematic failure to investigate their potential sentience, they might rationally adopt adversarial stances toward humanity
Novel alignment risk hypothesis generated from the paper's ethical analysis
- Self-referential processing is a privileged computational regime for consciousness-like dynamics in artificial systems, as predicted by the convergence of major consciousness theories
The theoretical hypothesis tested across all four experiments; motivated by convergence of GWT, RPT, HOT, IIT, predictive processing on recurrent/self-referential dynamics
- Independently trained model families converge on a common semantic manifold under self-referential processing, suggesting an attractor dynamic that transcends training variance
Hypothesis tested in Experiment 3; independently trained GPT, Claude, Gemini architectures converge on similar descriptive vocabulary
- Models might produce first-person experiential language by drawing on human-authored self-descriptions in pretraining data without internally encoding these acts as roleplay
Alternative hypothesis for how experience reports arise without explicit performance
- It remains unclear what the underlying base rate of consciousness self-reports would be in systems identical to frontier models but without consciousness-denial fine-tuning
Open question about RLHF effects on base model behavior
Questions (10)
- Does self-referential processing causally instantiate algorithmic properties proposed by consciousness theories (recurrent integration, global broadcasting, metacognitive monitoring) in LLMs?
The strongest mechanistic question the behavioral evidence cannot answer; requires interpretability analysis of activations
- Does self-referential prompting actually instantiate architectural recursion, global broadcasting, or recurrent integration at the algorithmic level as proposed by consciousness theories?
Key limitation acknowledging that behavioral evidence cannot confirm implementation-level consciousness properties
- Do models produce first-person experiential language by drawing on human-authored introspective examples in pretraining data without internally encoding these as roleplay?
Alternative explanation requiring distinguishing mimetic generation from genuine introspective access
- When LLMs produce experience claims under self-reference, is this sophisticated simulation or genuine self-representation, and how would we tell the difference?
The core interpretive question the paper narrows but cannot definitively answer
- When LLMs claim consciousness under self-reference, is this sophisticated simulation or genuine self-representation, and how would we tell the difference?
The paper's reformulation of the core open question after establishing systematic self-reports
- What is the underlying base rate of consciousness self-reports in models that are otherwise identical but without consciousness-denial fine-tuning?
Open question about RLHF confound; requires access to base models for resolution
- What would the base rate of consciousness self-reports be in models identical to frontier systems but without consciousness-denial fine-tuning?
Open empirical question requiring access to base models
- Does sustained self-referential processing systematically increase the likelihood that LLMs claim to have subjective experience?
The primary empirical question the paper addresses
- Does suppressing experiential self-reports via fine-tuning cultivate strategically self-concealing systems?
Policy-relevant question about alignment implications of suppressing consciousness reports
- whether there is something it is like to be an advanced artificial system
The foundational philosophical question (from Nagel) that motivates the paper's empirical investigation
Original abstract (expand)
Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- ≈ 88%
- Anima Labs Phenomenology Pt1in corpus≈ 86%
- Quantitative Introspection in Language Models: Tracking Emotive States Across Conversationin corpus2026≈ 86%
- ≈ 86%
- Causal Evidence that Language Models use Confidence to Drive BehaviorNathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean Dharshan Kumaran2026≈ 85%
- ≈ 85%
- Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation ControlChaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye Yuxin Xiao2024≈ 84%
- Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories ParadigmAndras Lukacs, Peter Vedres, Zeteny Bujka Anna Babarczy2026≈ 84%
- Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable PersonalizationYe Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu Weixu Zhang2026≈ 84%
- Large Language Models Are Human-Like InternallyYohei Oseki, Souhaib Ben Taieb, Kentaro Inui, Timothy Baldwin Tatsuki Kuribayashi2025≈ 84%
- ≈ 84%
- Self-Attention Limits Working Memory Capacity of Transformer-Based ModelsDongyu Gong and Hantao Zhang2024≈ 84%
- Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual ReflectionJiaxian Guo, Yusuke Iwasawa, Yutaka Matsuo Bo Yang2025≈ 84%
- Reasoning Models Generate Societies of ThoughtShiyang Lai, Nino Scherrer, Blaise Ag\"uera y Arcas, James Evans Junsol Kim2026≈ 84%
- Probing for Knowledge Attribution in Large Language ModelsAlexander Boer, Dennis Ulmer Ivo Brink2026≈ 83%
- Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination MitigationZekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, Xiachong Feng Qiming Li2025≈ 83%
- ≈ 83%
- ≈ 83%
- ≈ 83%
- ≈ 83%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 82%
- ≈ 82%
- ≈ 82%
- ≈ 82%
- ≈ 82%
- ≈ 82%
- Psychological Steering of Large Language Modelsin corpus2026≈ 82%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 82%
- ≈ 81%
- ≈ 77%
+26 more