finding
active
finding:claude-opus-4-1-and-4-show-greatest-reduction-in-apology-rate-in-the-prefill-detection-taskClaude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection task
Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
Source paper
extracted_from(2026) · Lindsey, Jack
Neighborhood — ranked by edge-count
Claims (1)
claim
- Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested modelssupportsBased on consistent best performance across experiments.
Communities (3)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Probing Claude and other models for internal detection of artificially injected thoughts across layers.
- Mechanistic interpretability studies of Claude models using layer-wise representation analysis and thought injection to reveal unverbalized reasoning, planning, and covert cognition.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2finding0.831In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
- The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
- Suggests that later models can keep the thought 'silent' rather than letting it influence output.
- Linked to Claude 3.5 Sonnet not exhibiting pro-animal-welfare preferences
- Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content
- Illustrates NLA's capture of high-level cognition and hallucination of specifics; corroborated with attribution graphs.
- Explanation for the 'silent' thought phenomenon.
- Core empirical result for animal welfare setting; higher rate than helpful-only