finding

active

finding:opus-4-1-and-4-have-highest-true-positive-rates-among-production-models

Opus 4.1 and 4 have highest true positive rates among production models

In model comparisons, Opus 4.1/4 stand out for high true positive detection.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
LLM introspective awareness of injected concepts
members_of
Probing Claude and other models for internal detection of artificially injected thoughts across layers.
Internal reasoning detection via neural activation analysis
members_of
Mechanistic interpretability studies of Claude models using layer-wise representation analysis and thought injection to reveal unverbalized reasoning, planning, and covert cognition.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Opus 4.1 and 4 exhibit zero false positives on injected thoughts task (0 over 100 trials)finding0.796
Production Opus 4.1/4 never falsely claim an injected thought when none is present.
Notably, Claude Opus 4.1 and 4—the most recently released and most capable models of those that we test—perform the best in our experiments, suggesting that introspective capabilities may emerge alongside other improvements to language models.quote0.781
Key finding about the relationship between capability and introspection.
Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested modelsclaim0.780
Based on consistent best performance across experiments.
Prompt variant detection rate 18% (9 out of 50 trials) for Opus 4.1finding0.768
On a variant of the injected thoughts prompt allowing the model to mention a concept regardless, detection rate was 18%.
Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.764
Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
Claude 4 Opus reports subjective experience in 100% experimental, 82% history, 22% conceptual, and 100% zero-shot trialsfinding0.760
Outlier result for Claude 4 Opus suggesting different baseline behavior from other models
Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2finding0.758
In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBenchfinding0.750
Quantifies harness adherence failure gap between strong and weak tier models