finding
active
finding:prefill-detection-effect-peaks-at-an-earlier-layer-slightly-over-halfway-through-in-opus-4-1-different-from-injected-thoughts-peakPrefill detection effect peaks at an earlier layer (slightly over halfway through) in Opus 4.1, different from injected thoughts peak
The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
Source paper
extracted_from(2026) · Lindsey, Jack
Neighborhood — ranked by edge-count
Claims (1)
claim
- Based on layer-selective perturbation results.
Communities (3)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Probing Claude and other models for internal detection of artificially injected thoughts across layers.
- Mechanistic interpretability studies of Claude models using layer-wise representation analysis and thought injection to reveal unverbalized reasoning, planning, and covert cognition.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.821Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
- Introspective awareness peaks at a layer about two-thirds through Opus 4.1 for injected thoughtsfinding0.817The success rate shows a sharp peak at a specific middle layer.
- Explanation for the 'silent' thought phenomenon.
- Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2finding0.808In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
- Suggests that later models can keep the thought 'silent' rather than letting it influence output.
- Thought detection peaks at ~2/3 layer depth; intention checking peaks at ~1/2 layer depth.finding0.798Lindsey (2026) differential layer performance explained by Janus's path combinatorics — different tasks use different path distributions.
- Opus 4.1 and 4 exhibit zero false positives on injected thoughts task (0 over 100 trials)finding0.797Production Opus 4.1/4 never falsely claim an injected thought when none is present.
- Speculation that QK circuit 'concordance heads' underlie the ability to distinguish intended from unintended outputs.