finding

active

finding:prefill-detection-effect-peaks-at-an-earlier-layer-slightly-over-halfway-through-in-opus-4-1-different-from-injected-thoughts-peak

Prefill detection effect peaks at an earlier layer (slightly over halfway through) in Opus 4.1, different from injected thoughts peak

The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Claims (1)

claim

Different forms of introspection invoke mechanistically different processes
supports
Based on layer-selective perturbation results.

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
LLM introspective awareness of injected concepts
members_of
Probing Claude and other models for internal detection of artificially injected thoughts across layers.
Internal reasoning detection via neural activation analysis
members_of
Mechanistic interpretability studies of Claude models using layer-wise representation analysis and thought injection to reveal unverbalized reasoning, planning, and covert cognition.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.821
Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
Introspective awareness peaks at a layer about two-thirds through Opus 4.1 for injected thoughtsfinding0.817
The success rate shows a sharp peak at a specific middle layer.
In Opus 4.1, the think word representation decays to baseline in the final layer because the strong next-token prediction drowns out other representationshypothesis0.810
Explanation for the 'silent' thought phenomenon.
Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2finding0.808
In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
In Opus 4.1, representation of the think word decays to baseline by the final layer, unlike Claude 3 models where it persistsfinding0.807
Suggests that later models can keep the thought 'silent' rather than letting it influence output.
Thought detection peaks at ~2/3 layer depth; intention checking peaks at ~1/2 layer depth.finding0.798
Lindsey (2026) differential layer performance explained by Janus's path combinatorics — different tasks use different path distributions.
Opus 4.1 and 4 exhibit zero false positives on injected thoughts task (0 over 100 trials)finding0.797
Production Opus 4.1/4 never falsely claim an injected thought when none is present.
The prefill detection task may involve concordance heads that measure the likelihood of the output given prior activationsclaim0.776
Speculation that QK circuit 'concordance heads' underlie the ability to distinguish intended from unintended outputs.