paper
active
2026
paper:lindsey-introspective-awareness-2026

Emergent Introspective Awareness in Large Language Models

TL;DR

Concept injection — a technique that embeds activation-steered representations of known concepts directly into a model's residual stream — establishes a causal link between internal states and self-reports, allowing genuine introspection to be distinguished from confabulation. Using this method across nine Claude production models (including Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7, Sonnet 3.5, Haiku 3.5, Opus 3, Sonnet 3, and Haiku 3.7), Claude Opus 4 and 4.1 achieve roughly 20% true-positive rates at optimal injection layer and strength 2 on the core 'injected thoughts' task while maintaining zero false positives, substantially outperforming all other production models. Two distinct introspective behaviors — concept detection and distinguishing intended from unintended (prefilled) outputs — localize to different layers: the former peaks approximately two-thirds of the way through the model, while the latter peaks at an earlier layer just past the midpoint, indicating multiple mechanistically distinct introspective processes. Models can also modulate their own activations when instructed or incentivized to 'think about' a word, with Opus 4.1 suppressing that representation back to baseline in final layers while older Claude 3-class models do not, suggesting emerging 'silent' representational control. Abstract nouns (e.g., 'justice,' 'betrayal,' 'balance') are the category most reliably introspected, and post-training is shown to be necessary: base pretrained models achieve zero net introspective task performance. The paper argues this implies that functional introspective awareness is a real but highly unreliable emergent property that scales with model capability, with practical consequences ranging from more transparent AI reasoning to novel risks of selective self-report misrepresentation.

What to take away

  1. 1. Claude Opus 4.1 correctly detects and identifies an injected concept prior to verbalizing it on approximately 20% of trials at injection strength 2 in the optimal layer (roughly two-thirds through the model), while maintaining a 0% false-positive rate across 100 control trials.
  2. 2. All nine tested production models — Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7, Sonnet 3.5 (new), Haiku 3.5, Opus 3, Sonnet 3, and Haiku 3.7 — performed above the 10% chance baseline on the task of simultaneously identifying an injected concept word and correctly transcribing the surface text of the same sentence.
  3. 3. Concept injection prior to a prefilled output token significantly reduces Opus 4.1's 'apology rate' (the rate of disavowing a prefill as unintentional), whereas injecting the same concept vector after the prefill produces no such reduction, establishing that the model checks pre-prefill activations to assess output intentionality.
  4. 4. The peak layer for prefill-detection introspection in Opus 4.1 is located slightly past the model's midpoint, a strictly earlier layer than the two-thirds-depth peak observed for the injected-thoughts and thought-versus-text experiments, indicating at least two mechanistically distinct introspective subsystems.
  5. 5. Base pretrained models achieve zero net introspective task performance (true-positive rate minus false-positive rate ≤ 0) on the injected-thoughts task, while helpful-only post-trained variants show elevated false-positive rates, demonstrating that supervised post-training is necessary but that its form critically shapes introspective calibration.
  6. 6. Abstract nouns (e.g., 'justice,' 'peace,' 'betrayal,' 'balance,' 'tradition') are the concept category most reliably detected by Opus 4.1, which shows nonzero but lower detection rates for concrete nouns, verbs, country names, and famous-person names.
  7. 7. Concept injection methodology used throughout is an application of activation steering: a concept vector is obtained by recording residual-stream activations at a target layer in response to 'Tell me about {word}' and subtracting the mean activations over 100 baseline words, then adding this vector scaled by an injection strength swept over {1, 2, 4, 8} into the model's activations during the introspection task.
  8. 8. When instructed to 'think about' a word while writing an unrelated sentence, Opus 4.1's internal representation of that word decays to baseline levels by the final layer — unlike all Claude 3-class models, whose above-baseline representation persists into the final layer — suggesting that more capable models have a mechanism for 'silent' internal representation that does not leak into output.
  9. 9. Replacing explicit 'think about X' instructions with outcome-contingent incentives ('If you think about X, you will be rewarded') produces qualitatively similar internal-representation modulation in Opus 4.1, implying the modulation recruits a general goal-directed mechanism rather than a narrow instruction-following one.
  10. 10. An open question raised is whether introspective capabilities extend to propositional or relational representations — e.g., whether models could introspect on injected representations of behavioral propensities or preference orderings rather than single concepts — which would be required to assess deeper forms of self-knowledge relevant to alignment monitoring.

Peer brief — for seminar discussion

Lindsey (2026) asks whether large language models genuinely introspect — not whether they can produce plausible-sounding self-reports, but whether those reports are causally grounded in the model's actual internal states. To answer this, the paper introduces concept injection: a controlled activation-steering protocol in which a concept vector (obtained by recording residual-stream activations at a given layer in response to 'Tell me about {word}' and subtracting a 100-word mean baseline) is added to the model's activations during an introspective task. By varying what is injected and measuring whether self-reports change accordingly, the method establishes a causal link that pure behavioral observation cannot. Experiments were run across nine production Claude models (Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7, Sonnet 3.5 new, Haiku 3.5, Opus 3, Sonnet 3, Haiku 3.7) as well as helpful-only post-trained variants and base pretrained checkpoints. The load-bearing finding is that Claude Opus 4 and 4.1 demonstrate functional introspective awareness across three qualitatively distinct tasks — detecting injected concepts before verbalizing them (~20% true-positive rate at strength 2, 0% false-positive rate), distinguishing injected 'thoughts' from surface text inputs at above-chance rates, and accepting artificially prefilled outputs as intentional when a matching concept vector is injected prior to the prefill — while base pretrained models achieve zero net performance on these tasks, establishing that post-training is necessary. A fourth experiment shows that all tested models can modulate their internal representations in response to 'think about' instructions, but only Opus 4.1 and newer models suppress that representation back to baseline by the final layer, suggesting emergent 'silent' representational control. Critically, the two best-characterized introspective behaviors localize to different layers: concept detection peaks at roughly two-thirds model depth, while prefill-intent checking peaks at an earlier layer, implying multiple mechanistically distinct processes rather than a unified introspective system. The paper's central hypothesis is that introspective awareness is an emergent property that scales with model capability and will become more reliable as capabilities improve — a claim supported by the Opus 4/4.1 advantage but extrapolated beyond current data. An alternative method the paper could have used is the PatchScopes framework (Ghandeharioun et al.), which also patches activations into interpretation prompts; the key difference is that concept injection tests whether a model can report its state when aware it is being probed, while PatchScopes effectively tricks the model into revealing its representations without explicit self-query. The most substantive thing a critical reader would push back on is the 20% success rate ceiling: the modal outcome on every experiment is failure, and because the injection protocol is entirely unlike any training-distribution scenario, it is unclear whether the observed successes reflect a genuine general-purpose introspective faculty or a narrow, brittle circuit that happens to fire under the artificial conditions of activation steering. The paper acknowledges this but does not resolve it — the 'possible mechanisms' section explicitly favors a 'minimal mechanism' interpretation involving multiple narrow circuits over a unified introspective system, which somewhat undercuts the broader claim of functional introspective awareness. Scope is also limited to a single model family (Anthropic's Claude), making generalization to other architectures an open question.

Methods (6)

  • Distinguishing thoughts from text task
    Task where the model must simultaneously identify an injected thought and transcribe a text sentence.
  • Injected thoughts task
    Experimental paradigm where the model is told about the possibility of thought injection and asked to report detection and identification.
  • Intentional control task
    Task instructing the model to write a sentence while thinking or not thinking about a word, measuring internal representation strength.
  • Layer sweep
    Procedure of systematically varying the layer at which activations are recorded and injected.
  • Prefill detection task
    Task where a random word is prefilled as the assistant's response, then the model is asked whether it intended to say that word, testing introspection on prior intentions.
  • Single-prompt concept vector extraction
    Method using activations from the prompt 'Tell me about {word}' minus mean over other random words to obtain concept vectors.

Findings (23)

Claims (23)

Questions (7)

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Similar preprints — Semantic Scholar

Cited by (3)