finding

active

finding:introspective-awareness-peaks-at-a-layer-about-two-thirds-through-opus-4-1-for-injected-thoughts

Introspective awareness peaks at a layer about two-thirds through Opus 4.1 for injected thoughts

The success rate shows a sharp peak at a specific middle layer.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Claims (1)

claim

Different forms of introspection invoke mechanistically different processes
supports
Based on layer-selective perturbation results.

Communities (4)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Mechanistic introspection in language models
members_of
Empirical investigation of how LMs access and report internal states across layers, using concept injection and thought detection on Claude models.
LLM introspective awareness of injected concepts
members_of
Probing Claude and other models for internal detection of artificially injected thoughts across layers.
Mechanistic introspection in language models
members_of
Investigates how different introspective processes activate distinct computational mechanisms at specific model depths, using layer-wise analysis.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested modelsclaim0.825
Based on consistent best performance across experiments.
Prefill detection effect peaks at an earlier layer (slightly over halfway through) in Opus 4.1, different from injected thoughts peakfinding0.817
The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2finding0.801
In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
Thought detection peaks at ~2/3 layer depth; intention checking peaks at ~1/2 layer depth.finding0.799
Lindsey (2026) differential layer performance explained by Janus's path combinatorics — different tasks use different path distributions.
Introspective capabilities are confined to early-layer injections (L0-L5) and collapse to chance thereafterclaim0.794
Key quantitative characterization of the layer-dependence of partial introspection
Introspective signals appear in middle layers but are suppressed by later post-training-shaped layers.finding0.793
Mechanistic finding by Lindsey (2026) explaining how contemplative prompt may work: enables mid-layer introspection to reach output.
Notably, Claude Opus 4.1 and 4—the most recently released and most capable models of those that we test—perform the best in our experiments, suggesting that introspective capabilities may emerge alongside other improvements to language models.quote0.784
Key finding about the relationship between capability and introspection.
Opus 4.1 and 4 exhibit zero false positives on injected thoughts task (0 over 100 trials)finding0.783
Production Opus 4.1/4 never falsely claim an injected thought when none is present.