finding

active

finding:in-opus-4-1-representation-of-the-think-word-decays-to-baseline-by-the-final-layer-unlike-claude-3-models-where-it-persists

In Opus 4.1, representation of the think word decays to baseline by the final layer, unlike Claude 3 models where it persists

Suggests that later models can keep the thought 'silent' rather than letting it influence output.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
LLM introspective awareness of injected concepts
members_of
Probing Claude and other models for internal detection of artificially injected thoughts across layers.
Internal reasoning detection via neural activation analysis
members_of
Mechanistic interpretability studies of Claude models using layer-wise representation analysis and thought injection to reveal unverbalized reasoning, planning, and covert cognition.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

In Opus 4.1, the think word representation decays to baseline in the final layer because the strong next-token prediction drowns out other representationshypothesis0.932
Explanation for the 'silent' thought phenomenon.
Lindsey: Opus 4/4.1 show concept representations in middle layers that decay to baseline by final layer ('silent' internal process)finding0.868
Cited to support enacted vs described reflection distinction; capable models show silent mid-layer processing
Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.820
Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
All models exhibit above-baseline representation of the think word when instructed to think about itfinding0.819
In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
Opus 4.6 represented target language internally before switching languages, with persistent Russian representations appearing before plausible textual cuesfinding0.815
NLAs revealed unverbalized language processing in Opus 4.6 that led to discovery of malformed SFT training data.
Notably, Claude Opus 4.1 and 4—the most recently released and most capable models of those that we test—perform the best in our experiments, suggesting that introspective capabilities may emerge alongside other improvements to language models.quote0.812
Key finding about the relationship between capability and introspection.
Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested modelsclaim0.810
Based on consistent best performance across experiments.
Prefill detection effect peaks at an earlier layer (slightly over halfway through) in Opus 4.1, different from injected thoughts peakfinding0.807
The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.