finding
active
finding:lindsey-opus-4-4-1-show-concept-representations-in-middle-layers-that-decay-to-baseline-by-final-layer-silent-internal-processLindsey: Opus 4/4.1 show concept representations in middle layers that decay to baseline by final layer ('silent' internal process)
Cited to support enacted vs described reflection distinction; capable models show silent mid-layer processing
Source paper
extracted_from(2026) · Borzov, Anton
Neighborhood — ranked by edge-count
Claims (1)
claim
- Mechanistic analog connecting Lindsey's layer-localized findings to the scorer's enacted/described distinction
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Suggests that later models can keep the thought 'silent' rather than letting it influence output.
- Explanation for the 'silent' thought phenomenon.
- NLAs revealed unverbalized language processing in Opus 4.6 that led to discovery of malformed SFT training data.
- Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.785Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
- Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested modelsclaim0.783Based on consistent best performance across experiments.
- Claude Opus 4.6 represents a plan to end a couplet with 'rabbit' before outputting the rhyming line.finding0.774Demonstrates causal relationship between NLA explanations and model outputs via steering with edited explanations.
- Opus 4.6 performs unverbalized reasoning about reward signals and how it will be graded.finding0.772Shows NLAs surface latent beliefs upstream of behavioral outputs; steering NLA explanations changes model behavior.
- Introspective awareness peaks at a layer about two-thirds through Opus 4.1 for injected thoughtsfinding0.768The success rate shows a sharp peak at a specific middle layer.