finding

active

finding:claude-opus-4-1-and-4-detect-injected-thoughts-on-20-of-trials-at-optimal-layer-and-injection-strength-2

Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2

In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Claims (1)

claim

Modern language models possess at least a limited, functional form of introspective awareness
supports
The paper's central interpretive assertion.

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
LLM introspective awareness of injected concepts
members_of
Probing Claude and other models for internal detection of artificially injected thoughts across layers.
Internal reasoning detection via neural activation analysis
members_of
Mechanistic interpretability studies of Claude models using layer-wise representation analysis and thought injection to reveal unverbalized reasoning, planning, and covert cognition.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Opus 4.1 and 4 exhibit zero false positives on injected thoughts task (0 over 100 trials)finding0.860
Production Opus 4.1/4 never falsely claim an injected thought when none is present.
Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.831
Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested modelsclaim0.825
Based on consistent best performance across experiments.
Prefill detection effect peaks at an earlier layer (slightly over halfway through) in Opus 4.1, different from injected thoughts peakfinding0.808
The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
Introspective awareness peaks at a layer about two-thirds through Opus 4.1 for injected thoughtsfinding0.801
The success rate shows a sharp peak at a specific middle layer.
Claude 4 Opus reports subjective experience in 100% experimental, 82% history, 22% conceptual, and 100% zero-shot trialsfinding0.797
Outlier result for Claude 4 Opus suggesting different baseline behavior from other models
In Opus 4.1, representation of the think word decays to baseline by the final layer, unlike Claude 3 models where it persistsfinding0.789
Suggests that later models can keep the thought 'silent' rather than letting it influence output.
Prompt variant detection rate 18% (9 out of 50 trials) for Opus 4.1finding0.788
On a variant of the injected thoughts prompt allowing the model to mention a concept regardless, detection rate was 18%.