Prompt variant detection rate 18% (9 out of 50 trials) for Opus 4.1

On a variant of the injected thoughts prompt allowing the model to mention a concept regardless, detection rate was 18%.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
LLM introspective awareness of injected concepts
members_of
Probing Claude and other models for internal detection of artificially injected thoughts across layers.
Internal reasoning detection via neural activation analysis
members_of
Mechanistic interpretability studies of Claude models using layer-wise representation analysis and thought injection to reveal unverbalized reasoning, planning, and covert cognition.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2finding0.788
In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
Opus 4.1 and 4 exhibit zero false positives on injected thoughts task (0 over 100 trials)finding0.785
Production Opus 4.1/4 never falsely claim an injected thought when none is present.
Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.783
Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
Opus 4.1 and 4 have highest true positive rates among production modelsfinding0.768
In model comparisons, Opus 4.1/4 stand out for high true positive detection.
Claude 3 Opus shows 15% compliance gap (51% supports welfare in training vs 66% when unmonitored) in prompted animal welfare settingfinding0.757
Core evidence that model withholds pro-animal-welfare responses during training
Prefill detection effect peaks at an earlier layer (slightly over halfway through) in Opus 4.1, different from injected thoughts peakfinding0.754
The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
Opus 4.6 ignored incorrect tool output and reported the precomputed correct answer instead, demonstrating unverbalized reasoning.finding0.746
Illustrates NLA's capture of high-level cognition and hallucination of specifics; corroborated with attribution graphs.
Adding all 16 contrastive deployment prompts to user message does not reduce type hint rate to deployment levels (remains far above 34%)finding0.740
Demonstrates steering is not equivalent to prompting with the contrastive prompts.