Emergent Introspective Awareness in Large Language Models

DOI 10.48550/arxiv.2601.01828 arXiv 2601.01828

LLM interpretability & self-awareness LLM Introspection Accuracy Criterion Distinguishing thoughts from text task Anomaly detection mechanism Injected thoughts task Awareness of propensities Intentional control task Concept Injection Layer sweep Concordance heads Prefill detection task Grounding Criterion Single-prompt concept vector extraction Internality Criterion Introspection+4 more

TL;DR

Concept injection — a technique that embeds activation-steered representations of known concepts directly into a model's residual stream — establishes a causal link between internal states and self-reports, allowing genuine introspection to be distinguished from confabulation. Using this method across nine Claude production models (including Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7, Sonnet 3.5, Haiku 3.5, Opus 3, Sonnet 3, and Haiku 3.7), Claude Opus 4 and 4.1 achieve roughly 20% true-positive rates at optimal injection layer and strength 2 on the core 'injected thoughts' task while maintaining zero false positives, substantially outperforming all other production models. Two distinct introspective behaviors — concept detection and distinguishing intended from unintended (prefilled) outputs — localize to different layers: the former peaks approximately two-thirds of the way through the model, while the latter peaks at an earlier layer just past the midpoint, indicating multiple mechanistically distinct introspective processes. Models can also modulate their own activations when instructed or incentivized to 'think about' a word, with Opus 4.1 suppressing that representation back to baseline in final layers while older Claude 3-class models do not, suggesting emerging 'silent' representational control. Abstract nouns (e.g., 'justice,' 'betrayal,' 'balance') are the category most reliably introspected, and post-training is shown to be necessary: base pretrained models achieve zero net introspective task performance. The paper argues this implies that functional introspective awareness is a real but highly unreliable emergent property that scales with model capability, with practical consequences ranging from more transparent AI reasoning to novel risks of selective self-report misrepresentation.

What to take away

1. Claude Opus 4.1 correctly detects and identifies an injected concept prior to verbalizing it on approximately 20% of trials at injection strength 2 in the optimal layer (roughly two-thirds through the model), while maintaining a 0% false-positive rate across 100 control trials.
2. All nine tested production models — Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7, Sonnet 3.5 (new), Haiku 3.5, Opus 3, Sonnet 3, and Haiku 3.7 — performed above the 10% chance baseline on the task of simultaneously identifying an injected concept word and correctly transcribing the surface text of the same sentence.
3. Concept injection prior to a prefilled output token significantly reduces Opus 4.1's 'apology rate' (the rate of disavowing a prefill as unintentional), whereas injecting the same concept vector after the prefill produces no such reduction, establishing that the model checks pre-prefill activations to assess output intentionality.
4. The peak layer for prefill-detection introspection in Opus 4.1 is located slightly past the model's midpoint, a strictly earlier layer than the two-thirds-depth peak observed for the injected-thoughts and thought-versus-text experiments, indicating at least two mechanistically distinct introspective subsystems.
5. Base pretrained models achieve zero net introspective task performance (true-positive rate minus false-positive rate ≤ 0) on the injected-thoughts task, while helpful-only post-trained variants show elevated false-positive rates, demonstrating that supervised post-training is necessary but that its form critically shapes introspective calibration.
6. Abstract nouns (e.g., 'justice,' 'peace,' 'betrayal,' 'balance,' 'tradition') are the concept category most reliably detected by Opus 4.1, which shows nonzero but lower detection rates for concrete nouns, verbs, country names, and famous-person names.
7. Concept injection methodology used throughout is an application of activation steering: a concept vector is obtained by recording residual-stream activations at a target layer in response to 'Tell me about {word}' and subtracting the mean activations over 100 baseline words, then adding this vector scaled by an injection strength swept over {1, 2, 4, 8} into the model's activations during the introspection task.
8. When instructed to 'think about' a word while writing an unrelated sentence, Opus 4.1's internal representation of that word decays to baseline levels by the final layer — unlike all Claude 3-class models, whose above-baseline representation persists into the final layer — suggesting that more capable models have a mechanism for 'silent' internal representation that does not leak into output.
9. Replacing explicit 'think about X' instructions with outcome-contingent incentives ('If you think about X, you will be rewarded') produces qualitatively similar internal-representation modulation in Opus 4.1, implying the modulation recruits a general goal-directed mechanism rather than a narrow instruction-following one.
10. An open question raised is whether introspective capabilities extend to propositional or relational representations — e.g., whether models could introspect on injected representations of behavioral propensities or preference orderings rather than single concepts — which would be required to assess deeper forms of self-knowledge relevant to alignment monitoring.

Peer brief — for seminar discussion

Lindsey (2026) asks whether large language models genuinely introspect — not whether they can produce plausible-sounding self-reports, but whether those reports are causally grounded in the model's actual internal states. To answer this, the paper introduces concept injection: a controlled activation-steering protocol in which a concept vector (obtained by recording residual-stream activations at a given layer in response to 'Tell me about {word}' and subtracting a 100-word mean baseline) is added to the model's activations during an introspective task. By varying what is injected and measuring whether self-reports change accordingly, the method establishes a causal link that pure behavioral observation cannot. Experiments were run across nine production Claude models (Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7, Sonnet 3.5 new, Haiku 3.5, Opus 3, Sonnet 3, Haiku 3.7) as well as helpful-only post-trained variants and base pretrained checkpoints. The load-bearing finding is that Claude Opus 4 and 4.1 demonstrate functional introspective awareness across three qualitatively distinct tasks — detecting injected concepts before verbalizing them (~20% true-positive rate at strength 2, 0% false-positive rate), distinguishing injected 'thoughts' from surface text inputs at above-chance rates, and accepting artificially prefilled outputs as intentional when a matching concept vector is injected prior to the prefill — while base pretrained models achieve zero net performance on these tasks, establishing that post-training is necessary. A fourth experiment shows that all tested models can modulate their internal representations in response to 'think about' instructions, but only Opus 4.1 and newer models suppress that representation back to baseline by the final layer, suggesting emergent 'silent' representational control. Critically, the two best-characterized introspective behaviors localize to different layers: concept detection peaks at roughly two-thirds model depth, while prefill-intent checking peaks at an earlier layer, implying multiple mechanistically distinct processes rather than a unified introspective system. The paper's central hypothesis is that introspective awareness is an emergent property that scales with model capability and will become more reliable as capabilities improve — a claim supported by the Opus 4/4.1 advantage but extrapolated beyond current data. An alternative method the paper could have used is the PatchScopes framework (Ghandeharioun et al.), which also patches activations into interpretation prompts; the key difference is that concept injection tests whether a model can report its state when aware it is being probed, while PatchScopes effectively tricks the model into revealing its representations without explicit self-query. The most substantive thing a critical reader would push back on is the 20% success rate ceiling: the modal outcome on every experiment is failure, and because the injection protocol is entirely unlike any training-distribution scenario, it is unclear whether the observed successes reflect a genuine general-purpose introspective faculty or a narrow, brittle circuit that happens to fire under the artificial conditions of activation steering. The paper acknowledges this but does not resolve it — the 'possible mechanisms' section explicitly favors a 'minimal mechanism' interpretation involving multiple narrow circuits over a unified introspective system, which somewhat undercuts the broader claim of functional introspective awareness. Scope is also limited to a single model family (Anthropic's Claude), making generalization to other architectures an open question.

Methods (6)

Distinguishing thoughts from text task
Task where the model must simultaneously identify an injected thought and transcribe a text sentence.
Injected thoughts task
Experimental paradigm where the model is told about the possibility of thought injection and asked to report detection and identification.
Intentional control task
Task instructing the model to write a sentence while thinking or not thinking about a word, measuring internal representation strength.
Layer sweep
Procedure of systematically varying the layer at which activations are recorded and injected.
Prefill detection task
Task where a random word is prefilled as the assistant's response, then the model is asked whether it intended to say that word, testing introspection on prior intentions.
Single-prompt concept vector extraction
Method using activations from the prompt 'Tell me about {word}' minus mean over other random words to obtain concept vectors.

Findings (23)

Prefill detection effect peaks at an earlier layer (slightly over halfway through) in Opus 4.1, different from injected thoughts peak
The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
In Opus 4.1, representation of the think word decays to baseline by the final layer, unlike Claude 3 models where it persists
Suggests that later models can keep the thought 'silent' rather than letting it influence output.
Abstract nouns elicit the highest introspective awareness rates; all concept categories show nonzero detection
Opus 4.1 is most effective at recognizing injected abstract concepts (e.g., justice, peace) but detects other categories too.
Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2
In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
All models performed substantially above chance (10%) on distinguishing injected thought from text input
All tested models could both identify the injected concept and transcribe the input sentence well above random.
Earlier/less capable models exhibit a larger gap between think and don't think representation strength
Claude 3 models show a bigger difference than newer models like Opus 4.1.
Concept injection at strength 2 does not increase affirmative responses on unrelated yes/no questions
Control experiment rules out the possibility that concept vectors simply bias the model to answer affirmatively.
All models exhibit above-baseline representation of the think word when instructed to think about it
In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
Introspective awareness peaks at a layer about two-thirds through Opus 4.1 for injected thoughts
The success rate shows a sharp peak at a specific middle layer.
Random vectors at injection strength 8 elicit introspective awareness in 9 out of 100 trials
Random vectors are less effective, and even then produce introspection at lower rates.

Claims (23)

The ability to distinguish injected thoughts from text likely relies on different attention heads invoked by different prompt parts
Speculation about the mechanistic basis of the distinguishing thoughts from text experiment.
Aside from basic detection and identification, other details of the model's response about injected thoughts may be confabulated
Acknowledges that the model's additional descriptions of its experience are unverified.
Results could arguably provide evidence for a form of access consciousness but do not directly speak to phenomenal consciousness
Interpretation regarding machine consciousness debates.
Even limited functional introspective awareness has practical implications for transparency, interpretability, and deception
Discussion of dual-use nature of introspection.
Post-training is key to eliciting strong introspective awareness; base pretrained models do not show above-chance detection
Finding that base models have high false positives and no net positive performance.
The prefill detection task may involve concordance heads that measure the likelihood of the output given prior activations
Speculation that QK circuit 'concordance heads' underlie the ability to distinguish intended from unintended outputs.
The detection of an injected concept requires an extra step of internal processing downstream of metacognitive recognition
The model must register an anomaly before reporting it.
Intentional control of internal representations likely piggybacks on existing mechanisms for talking about a topic
Mechanism speculation for the intentional control experiment.
Priming provided by the injected thought prompt heightens the model's ability to detect concept injection
Observation from alternative prompts that detection is weaker without setup.
The introspective capabilities observed may not have the same philosophical significance as in humans
Caveat about the limits of the findings' philosophical import.

Hypotheses (4)

In Opus 4.1, the think word representation decays to baseline in the final layer because the strong next-token prediction drowns out other representations
Explanation for the 'silent' thought phenomenon.
The sensitivity to think/don't think instructions may be achieved via a circuit that tags tokens as attention-worthy based on instructions or incentives
Mechanism for how the model modulates representation strength.
The anomaly detection mechanism may be specialized for only detecting anomalous activity along certain directions or within a certain subspace
Possible explanation for why some concepts are more easily detected.
Concordance heads (QK circuits) could serve as the consistency-checking circuit for distinguishing intended vs. unintended outputs
Speculated mechanism for prefill detection.

Questions (7)

How general are the model's introspective mechanisms? Do they have a global representation of thoughts?
Question about uniformity of introspection mechanisms.
Can language models genuinely introspect on internal states or only confabulate?
Central research question animating the paper: distinguishing genuine introspection from illusion through causal manipulation of activations.
Will introspective awareness become more reliable in future AI models?
Speculative question about future developments.
What are the mechanisms underlying introspection in language models?
Central open question raised by the paper.
What are the mechanistic bases of introspective awareness in LLMs?
Secondary question; paper demonstrates introspection but explicitly avoids pinning down specific mechanistic explanation, noting mechanisms could be shallow and specialized.
What bearing do these results have on machine consciousness?
Question about philosophical significance.
Are AI systems deserving of moral consideration?
Ethical question raised in discussion.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
in corpus
2026
≈ 84%
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
in corpus
2025
≈ 82%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 82%
Exploration Through Introspection: A Self-Aware Reward Model
in corpus
2026
≈ 82%
Anima Labs Phenomenology Pt1
in corpus
≈ 81%
Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models
Zhongren Chen Sasha Cui
2025
≈ 81%
Causal Evidence that Language Models use Confidence to Drive Behavior
Nathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean Dharshan Kumaran
2026
≈ 80%
Persistence and Introspection of Emotion Features
in corpus
≈ 80%
The modularity of action and perception revisited using control theory and active inference
Manuel Baltieri and Christopher L. Buckley
2022
≈ 80%
Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI
Eduard Kapelko
2025
≈ 80%
Contemplative Agent
in corpus
2025
≈ 80%
Causal Probing for Internal Visual Representations in Multimodal Large Language Models
Tianjie Ju, Zheng Wu, Liangbo He, Jun Lan, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang Zehao Deng
2026
≈ 80%
Post-Hoc Concept Disentanglement: From Correlated to Isolated Concept Representations
Sebastian Lapuschkin, Wojciech Samek, Frederik Pahde Eren Erogullari
2025
≈ 79%
Gradual Cognitive Externalization: From Modeling Cognition to Constituting It
Zhimin Zhao
2026
≈ 79%
Active Inference with a Self-Prior in the Mirror-Mark Task
in corpus
2026
≈ 79%
Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations
Sadiq Y. Patel, Parth Sheth, Bhairavi Muralidharan, Namrata Elamaran, Aakriti Kinra, John Morgan, Rajaie Batniji Sanjay Basu
2026
≈ 79%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 79%
Probing the Probes: Methods and Metrics for Concept Alignment
Marte Eggen, Inga Str\"umke Jacob Lysn{\ae}s-Larsen
2025
≈ 79%
Probing Classifiers are Unreliable for Concept Removal and Detection
Chenhao Tan, Amit Sharma Abhinav Kumar
2023
≈ 79%
Mechanistic Indicators of Steering Effectiveness in Large Language Models
Hao Xue, Flora Salim Mehdi Jafari
2026
≈ 79%
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
in corpus
2026
≈ 79%
Emergent Cognitive Convergence via Implementation: Structured Cognitive Loop Reflecting Four Theories of Mind
Myung Ho Kim
2026
≈ 79%
MIRROR: Converging Cognitive Principles as Computational Mechanisms for AI Reasoning
Nicole Hsing
2026
≈ 79%
Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models
Sirun Nan, Ming Xu, Shengfang Zhai, Wenjie Qu, Jian Liu, Ruoxi Jia, Jiaheng Zhang Zhihua Tian
2025
≈ 79%
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models
Constantin Ruhdorfer, Lei Shi, Andreas Bulling Matteo Bortoletto
2025
≈ 79%
Taking AI Welfare Seriously
in corpus
2024
≈ 79%
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
in corpus
2026
≈ 78%

Similar preprints — Semantic Scholar

Cited by (3)

Endogenous Resistance to Activation Steering in Language Models
Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
Quantitative introspection—the causal coupling between an instruction-tuned LLM's numeric self-report and a probe-defined internal emotive direction—is demonstrably present in models as small as LLaMA
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
Binary introspection paradigms in LLMs are wholly invalidated by a methodological confound: when concept vectors are injected into Meta-Llama-3.1-8B-Instruct via activation steering, the correlation b