finding
active
finding:prompt-variant-detection-rate-18-9-out-of-50-trials-for-opus-4-1Prompt variant detection rate 18% (9 out of 50 trials) for Opus 4.1
On a variant of the injected thoughts prompt allowing the model to mention a concept regardless, detection rate was 18%.
Source paper
extracted_from(2026) · Lindsey, Jack
Neighborhood — ranked by edge-count
Communities (3)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Probing Claude and other models for internal detection of artificially injected thoughts across layers.
- Mechanistic interpretability studies of Claude models using layer-wise representation analysis and thought injection to reveal unverbalized reasoning, planning, and covert cognition.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2finding0.788In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
- Opus 4.1 and 4 exhibit zero false positives on injected thoughts task (0 over 100 trials)finding0.785Production Opus 4.1/4 never falsely claim an injected thought when none is present.
- Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.783Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
- In model comparisons, Opus 4.1/4 stand out for high true positive detection.
- Core evidence that model withholds pro-animal-welfare responses during training
- The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
- Illustrates NLA's capture of high-level cognition and hallucination of specifics; corroborated with attribution graphs.
- Demonstrates steering is not equivalent to prompting with the contrastive prompts.