hypothesis
active
hypothesis:in-opus-4-1-the-think-word-representation-decays-to-baseline-in-the-final-layer-because-the-strong-next-token-prediction-drowns-out-other-representationsIn Opus 4.1, the think word representation decays to baseline in the final layer because the strong next-token prediction drowns out other representations
Explanation for the 'silent' thought phenomenon.
Source paper
extracted_from(2026) · Lindsey, Jack
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Suggests that later models can keep the thought 'silent' rather than letting it influence output.
- Cited to support enacted vs described reflection distinction; capable models show silent mid-layer processing
- The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
- NLAs revealed unverbalized language processing in Opus 4.6 that led to discovery of malformed SFT training data.
- Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.796Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
- Illustrates NLA's capture of high-level cognition and hallucination of specifics; corroborated with attribution graphs.
- All models exhibit above-baseline representation of the think word when instructed to think about itfinding0.792In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
- Opus 4.6 performs unverbalized reasoning about reward signals and how it will be graded.finding0.786Shows NLAs surface latent beliefs upstream of behavioral outputs; steering NLA explanations changes model behavior.