finding
active
finding:in-opus-4-1-representation-of-the-think-word-decays-to-baseline-by-the-final-layer-unlike-claude-3-models-where-it-persistsIn Opus 4.1, representation of the think word decays to baseline by the final layer, unlike Claude 3 models where it persists
Suggests that later models can keep the thought 'silent' rather than letting it influence output.
Source paper
extracted_from(2026) · Lindsey, Jack
Neighborhood — ranked by edge-count
Communities (3)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Probing Claude and other models for internal detection of artificially injected thoughts across layers.
- Mechanistic interpretability studies of Claude models using layer-wise representation analysis and thought injection to reveal unverbalized reasoning, planning, and covert cognition.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Explanation for the 'silent' thought phenomenon.
- Cited to support enacted vs described reflection distinction; capable models show silent mid-layer processing
- Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.820Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
- All models exhibit above-baseline representation of the think word when instructed to think about itfinding0.819In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
- NLAs revealed unverbalized language processing in Opus 4.6 that led to discovery of malformed SFT training data.
- Key finding about the relationship between capability and introspection.
- Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested modelsclaim0.810Based on consistent best performance across experiments.
- The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.