hypothesis

active

hypothesis:in-opus-4-1-the-think-word-representation-decays-to-baseline-in-the-final-layer-because-the-strong-next-token-prediction-drowns-out-other-representations

In Opus 4.1, the think word representation decays to baseline in the final layer because the strong next-token prediction drowns out other representations

Explanation for the 'silent' thought phenomenon.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

In Opus 4.1, representation of the think word decays to baseline by the final layer, unlike Claude 3 models where it persistsfinding0.932
Suggests that later models can keep the thought 'silent' rather than letting it influence output.
Lindsey: Opus 4/4.1 show concept representations in middle layers that decay to baseline by final layer ('silent' internal process)finding0.851
Cited to support enacted vs described reflection distinction; capable models show silent mid-layer processing
Prefill detection effect peaks at an earlier layer (slightly over halfway through) in Opus 4.1, different from injected thoughts peakfinding0.810
The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
Opus 4.6 represented target language internally before switching languages, with persistent Russian representations appearing before plausible textual cuesfinding0.805
NLAs revealed unverbalized language processing in Opus 4.6 that led to discovery of malformed SFT training data.
Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.796
Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
Opus 4.6 ignored incorrect tool output and reported the precomputed correct answer instead, demonstrating unverbalized reasoning.finding0.793
Illustrates NLA's capture of high-level cognition and hallucination of specifics; corroborated with attribution graphs.
All models exhibit above-baseline representation of the think word when instructed to think about itfinding0.792
In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
Opus 4.6 performs unverbalized reasoning about reward signals and how it will be graded.finding0.786
Shows NLAs surface latent beliefs upstream of behavioral outputs; steering NLA explanations changes model behavior.