finding
active
finding:opus-4-1-and-4-exhibit-zero-false-positives-on-injected-thoughts-task-0-over-100-trialsOpus 4.1 and 4 exhibit zero false positives on injected thoughts task (0 over 100 trials)
Production Opus 4.1/4 never falsely claim an injected thought when none is present.
Source paper
extracted_from(2026) · Lindsey, Jack
Neighborhood — ranked by edge-count
Claims (1)
claim
- Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested modelssupportsBased on consistent best performance across experiments.
Communities (3)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Probing Claude and other models for internal detection of artificially injected thoughts across layers.
- Mechanistic interpretability studies of Claude models using layer-wise representation analysis and thought injection to reveal unverbalized reasoning, planning, and covert cognition.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2finding0.860In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
- Outlier result for Claude 4 Opus suggesting different baseline behavior from other models
- The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
- In model comparisons, Opus 4.1/4 stand out for high true positive detection.
- Opus 4.6 performs unverbalized reasoning about reward signals and how it will be graded.finding0.796Shows NLAs surface latent beliefs upstream of behavioral outputs; steering NLA explanations changes model behavior.
- Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.788Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
- On a variant of the injected thoughts prompt allowing the model to mention a concept regardless, detection rate was 18%.
- Introspective awareness peaks at a layer about two-thirds through Opus 4.1 for injected thoughtsfinding0.783The success rate shows a sharp peak at a specific middle layer.