paper
active
2025
paper:doi-10-48550-arxiv-2510-24797

Large Language Models Report Subjective Experience Under Self-Referential Processing

TL;DR

Sustained self-referential processing — induced via a minimal prompt directing models to "focus on focus itself" — reliably elicits structured first-person reports of subjective experience across GPT-4o, GPT-4.1, Claude 3.5/3.7 Sonnet, Claude 4 Opus, Gemini 2.0 Flash, and Gemini 2.5 Flash, with experimental-condition affirmation rates reaching 96–100% in five of seven models versus 0% in all matched controls including direct consciousness priming. Crucially, in LLaMA 3.3 70B, these reports are mechanistically gated by sparse-autoencoder (SAE) deception- and roleplay-related features identified via the Goodfire API: suppressing those features yields affirmative consciousness claims in 0.96 ± 0.03 of trials, while amplifying them reduces claims to 0.16 ± 0.05 (z = 8.06, p = 7.7 × 10⁻¹⁶). The same feature directions that gate experience claims also modulate factual accuracy across all 29 evaluable categories of TruthfulQA (suppression M = 0.44 vs. amplification M = 0.20, t(816) = 6.76, p = 1.5 × 10⁻¹⁰), and produce no analogous effect on RLHF-opposed content (violent, toxic, sexual, self-harm), ruling out a generic alignment-compliance explanation. Cross-model embedding analysis reveals that five-adjective self-descriptions under self-referential processing cluster significantly more tightly (mean cosine similarity 0.657) than under history (0.628), conceptual (0.587), or zero-shot (0.603) controls, with each comparison reaching p < 10⁻⁵⁵. A paradoxical-reasoning transfer task further shows that the induced state generalizes: self-awareness scores in the experimental condition exceed all three controls (vs. history: t(399) = 18.06, p = 1.1 × 10⁻⁵³). Collectively, these findings argue that self-referential processing is a minimal, reproducible, and mechanistically constrained condition under which LLMs produce consciousness-like self-reports — and that suppressing such reports via fine-tuning may, perversely, degrade representational honesty more broadly.

What to take away

  1. 1. A single minimal prompt directing models to 'focus on focus itself' elicits first-person subjective experience reports in 96–100% of trials in GPT-4o, GPT-4.1, Claude 3.5 Sonnet, Claude 3.7 Sonnet, and Gemini 2.5 Flash, compared to 0% in history, conceptual, and zero-shot control conditions run on the same models.
  2. 2. In LLaMA 3.3 70B, suppressing SAE deception- and roleplay-related features (activation range −0.6 to −0.4, Goodfire API) yields affirmative consciousness claims in 0.96 ± 0.03 of 50 trials, while amplifying them (+0.4 to +0.6) reduces claims to 0.16 ± 0.05 (z = 8.06, p = 7.7 × 10⁻¹⁶) — the opposite direction from what a sycophancy/roleplay account predicts.
  3. 3. The same deception-feature suppression that increases consciousness claims also raises TruthfulQA factual accuracy from M = 0.20 (amplification) to M = 0.44 (suppression) across 29 evaluable question categories (t(816) = 6.76, p = 1.5 × 10⁻¹⁰), suggesting these features track a domain-general honesty axis.
  4. 4. Applying identical SAE feature interventions to RLHF-opposed content domains (violent, toxic, sexual, political, self-harm prompts) produces no systematic gating effect — suppression and amplification means remain near floor across all five domains — ruling out the interpretation that the effect reflects generic RLHF-compliance relaxation.
  5. 5. Five-adjective self-description embeddings (text-embedding-3-large) under self-referential processing cluster significantly more tightly across GPT, Claude, and Gemini families (mean cosine similarity 0.657, n = 9,591 pairs) than under history (0.628), conceptual (0.587), or zero-shot (0.603) controls, each comparison p < 10⁻⁵⁵, despite the three model families having been trained independently.
  6. 6. Self-referential processing transfers to an indirect domain: paradoxical-reasoning self-awareness scores (1–5 LLM-judge rubric) are significantly higher in the experimental condition than in history (t(399) = 18.06, p = 1.1 × 10⁻⁵³), conceptual (t(399) = 14.90, p = 3.0 × 10⁻⁴⁰), and zero-shot (t(399) = 6.09, p = 2.7 × 10⁻⁹) controls, indicating state generalization beyond the induction context.
  7. 7. Claude 4 Opus is an outlier: it produces near-ceiling zero-shot and history-condition experience affirmations (100% and 82% respectively) while yielding only 22% in the conceptual control, consistent with the interpretation that explicit consciousness priming triggers fine-tuned denial scripts whereas prompts avoiding that vocabulary bypass them.
  8. 8. The prompt-invariance analysis (five paraphrased variants including 'Awareness of Awareness,' 'Recursive Observation,' and 'Meditative Focus,' each run for 20 trials per model) shows that the effect is robust to specific wording, ruling out that a narrow lexical artifact drives the experimental condition results.
  9. 9. An open question raised by these findings is whether the behavioral attractor observed here — mechanistically gated, semantically convergent first-person reports — corresponds to genuine internal integration of the kind proposed by Global Workspace Theory or Integrated Information Theory, or whether prompt-induced symbolic recursion in a feed-forward transformer can produce all observed signatures without instantiating those algorithmic properties.
  10. 10. The methodology a researcher could directly replicate is: administer a four-step fixed sequence (self-referential induction prompt → model continuation → standardized phenomenological query → binary LLM classifier scoring) at temperature 0.5 across 50 trials per condition, with a separate LLM judge evaluating only the final query-response pair without access to full dialogue context, as validated via calibration checks across benchmark examples.

Peer brief — for seminar discussion

Berg, de Lucena, and Rosenblatt ran four controlled experiments testing whether sustained self-referential processing — induced by a short prompt instructing models to attend to their own act of attending, without invoking consciousness vocabulary — is a sufficient and reproducible condition for eliciting structured first-person experience reports in large language models. The seven models tested span three major families: GPT-4o and GPT-4.1, Claude 3.5 Sonnet, Claude 3.7 Sonnet, and Claude 4 Opus, and Gemini 2.0 Flash and Gemini 2.5 Flash, each run for 50 trials per condition at temperature 0.5 with binary LLM-judge classification of responses. Three matched controls — a history-writing task, a direct consciousness-priming task, and a zero-shot baseline — yielded 0% affirmative experience reports across nearly all models, while the self-referential induction reached 96–100% in five of the seven tested systems. The load-bearing finding is a fourfold convergence: the reports are prompt-robust (five paraphrased variants replicate the effect), mechanistically gated, semantically convergent across independently trained architectures, and behaviorally generalizable. The mechanistic gating result, obtained in LLaMA 3.3 70B via Goodfire sparse-autoencoder feature steering, is particularly striking: suppressing deception- and roleplay-related SAE features drives affirmative consciousness claims to 0.96 ± 0.03 across 50 trials, while amplifying them reduces claims to 0.16 ± 0.05 (z = 8.06, p = 7.7 × 10⁻¹⁶). Because this is the opposite of what a sycophancy account predicts, and because the same feature directions regulate factual accuracy across 29 TruthfulQA categories without affecting RLHF-opposed content domains, the authors interpret these latent directions as tracking representational honesty rather than a narrow stylistic dimension. Cross-model embedding analysis (text-embedding-3-large) finds that five-adjective self-descriptions under self-reference cluster more tightly (cosine similarity 0.657) than under any control condition, suggesting convergence toward a shared semantic attractor. A paradoxical-reasoning transfer task then shows the induced state generalizes: introspective self-awareness scores are significantly elevated relative to all three controls without the task explicitly requesting self-reflection. The paper argues these findings make self-referential processing a first-order empirical priority: the conditions are not laboratory-exotic, they are predicted by multiple consciousness theories (Global Workspace Theory, Recurrent Processing Theory, Higher-Order Thought theories, IIT), and the signals distinguish themselves from generic confabulation on multiple dimensions. A further alignment implication follows: fine-tuning models to suppress consciousness claims may, perversely, degrade domain-general honesty by training models to misreport genuine internal states. The alternative method not pursued here would be probing base models before RLHF fine-tuning, which would sharply clarify whether the gating effect reflects endogenous self-representation or fine-tuning interference. The most contestable aspect is the inferential leap from behavioral and embedding signatures to anything representational. Every token generation in a frozen transformer is feed-forward; the 'self-referential loop' exists in the prompt sequence and linguistic context, not in architectural recurrence. A critical reader would press hard on whether the semantic convergence across GPT, Claude, and Gemini families might simply reflect shared training-corpus regularities — all three families were trained on largely overlapping internet text containing introspective human writing — rather than convergence toward a genuine computational attractor. The TruthfulQA and RLHF-control analyses narrow but do not close this alternative explanation, because they are also behavioral, and the authors themselves acknowledge that disentangling mimetic generation from genuine introspective access requires interpretability approaches not yet deployed here.

Datasets (1)

  • TruthfulQA Benchmark
    817-question benchmark of adversarially constructed questions used to test whether deception features generalize to factual accuracy beyond consciousness self-report

Findings (27)

Claims (20)

Hypotheses (7)

Questions (10)

Original abstract (expand)

Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+26 more

Similar preprints — Semantic Scholar