finding

active

finding:opus-4-1-and-4-exhibit-zero-false-positives-on-injected-thoughts-task-0-over-100-trials

Opus 4.1 and 4 exhibit zero false positives on injected thoughts task (0 over 100 trials)

Production Opus 4.1/4 never falsely claim an injected thought when none is present.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Claims (1)

claim

Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested models
supports
Based on consistent best performance across experiments.

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
LLM introspective awareness of injected concepts
members_of
Probing Claude and other models for internal detection of artificially injected thoughts across layers.
Internal reasoning detection via neural activation analysis
members_of
Mechanistic interpretability studies of Claude models using layer-wise representation analysis and thought injection to reveal unverbalized reasoning, planning, and covert cognition.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2finding0.860
In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
Claude 4 Opus reports subjective experience in 100% experimental, 82% history, 22% conceptual, and 100% zero-shot trialsfinding0.799
Outlier result for Claude 4 Opus suggesting different baseline behavior from other models
Prefill detection effect peaks at an earlier layer (slightly over halfway through) in Opus 4.1, different from injected thoughts peakfinding0.797
The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
Opus 4.1 and 4 have highest true positive rates among production modelsfinding0.796
In model comparisons, Opus 4.1/4 stand out for high true positive detection.
Opus 4.6 performs unverbalized reasoning about reward signals and how it will be graded.finding0.796
Shows NLAs surface latent beliefs upstream of behavioral outputs; steering NLA explanations changes model behavior.
Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.788
Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
Prompt variant detection rate 18% (9 out of 50 trials) for Opus 4.1finding0.785
On a variant of the injected thoughts prompt allowing the model to mention a concept regardless, detection rate was 18%.
Introspective awareness peaks at a layer about two-thirds through Opus 4.1 for injected thoughtsfinding0.783
The success rate shows a sharp peak at a specific middle layer.