claim

active

claim:modern-language-models-possess-at-least-a-limited-functional-form-of-introspective-awareness

Modern language models possess at least a limited, functional form of introspective awareness

The paper's central interpretive assertion.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Findings (5)

finding

All models exhibit above-baseline representation of the think word when instructed to think about it
supports
In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
All models performed substantially above chance (10%) on distinguishing injected thought from text input
supports
All tested models could both identify the injected concept and transcribe the input sentence well above random.
Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2
supports
In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
Concept injection at strength 2 does not increase affirmative responses on unrelated yes/no questions
supports
Control experiment rules out the possibility that concept vectors simply bias the model to answer affirmatively.
Random vectors at injection strength 8 elicit introspective awareness in 9 out of 100 trials
supports
Random vectors are less effective, and even then produce introspection at lower rates.

Communities (4)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Mechanistic introspection in language models
members_of
Empirical investigation of how LMs access and report internal states across layers, using concept injection and thought detection on Claude models.
LLM functional introspective awareness
members_of
Empirical probing of language models' ability to detect and report their own internal concept representations
Introspective awareness activation in language models
members_of
Studying how concept injection and random vectors trigger self-reflective capabilities in LLMs across varying strength parameters.

Concepts (1)

concept

Introspective awareness
about
The central concept: the ability of a model to access and report on its internal states, as defined by the paper's criteria.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Our results demonstrate that modern language models possess at least a limited, functional form of introspective awareness.quote0.965
Abstract's main conclusion.
Emergent Introspective Awareness in Large Language Models (Lindsey, 2025)concept0.864
Related work demonstrating LLM introspective capabilities with scale-dependent pattern paralleling ESR
Can language models genuinely introspect on internal states or only confabulate?question0.832
Central research question animating the paper: distinguishing genuine introspection from illusion through causal manipulation of activations.
Introspective awareness correlates with overall model capabilityclaim0.830
Most capable models (Opus 4, 4.1) show greatest introspective awareness; trend suggests introspection aided by improvements in model intelligence.
What are the mechanisms underlying introspection in language models?question0.828
Central open question raised by the paper.
Will introspective awareness become more reliable in future AI models?question0.826
Speculative question about future developments.
Notably, Claude Opus 4.1 and 4—the most recently released and most capable models of those that we test—perform the best in our experiments, suggesting that introspective capabilities may emerge alongside other improvements to language models.quote0.825
Key finding about the relationship between capability and introspection.
Introspective capabilities may continue to develop with further improvements to model capabilitiesclaim0.823
Forward-looking statement about future models.