finding
active
finding:introspective-awareness-peaks-at-a-layer-about-two-thirds-through-opus-4-1-for-injected-thoughtsIntrospective awareness peaks at a layer about two-thirds through Opus 4.1 for injected thoughts
The success rate shows a sharp peak at a specific middle layer.
Source paper
extracted_from(2026) · Lindsey, Jack
Neighborhood — ranked by edge-count
Claims (1)
claim
- Based on layer-selective perturbation results.
Communities (4)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Empirical investigation of how LMs access and report internal states across layers, using concept injection and thought detection on Claude models.
- Probing Claude and other models for internal detection of artificially injected thoughts across layers.
- Investigates how different introspective processes activate distinct computational mechanisms at specific model depths, using layer-wise analysis.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested modelsclaim0.825Based on consistent best performance across experiments.
- The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
- Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2finding0.801In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
- Thought detection peaks at ~2/3 layer depth; intention checking peaks at ~1/2 layer depth.finding0.799Lindsey (2026) differential layer performance explained by Janus's path combinatorics — different tasks use different path distributions.
- Key quantitative characterization of the layer-dependence of partial introspection
- Introspective signals appear in middle layers but are suppressed by later post-training-shaped layers.finding0.793Mechanistic finding by Lindsey (2026) explaining how contemplative prompt may work: enables mid-layer introspection to reach output.
- Key finding about the relationship between capability and introspection.
- Opus 4.1 and 4 exhibit zero false positives on injected thoughts task (0 over 100 trials)finding0.783Production Opus 4.1/4 never falsely claim an injected thought when none is present.