finding

active

finding:claude-opus-4-1-and-4-show-greatest-reduction-in-apology-rate-in-the-prefill-detection-task

Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection task

Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Claims (1)

claim

Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested models
supports
Based on consistent best performance across experiments.

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
LLM introspective awareness of injected concepts
members_of
Probing Claude and other models for internal detection of artificially injected thoughts across layers.
Internal reasoning detection via neural activation analysis
members_of
Mechanistic interpretability studies of Claude models using layer-wise representation analysis and thought injection to reveal unverbalized reasoning, planning, and covert cognition.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2finding0.831
In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
Prefill detection effect peaks at an earlier layer (slightly over halfway through) in Opus 4.1, different from injected thoughts peakfinding0.821
The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
In Opus 4.1, representation of the think word decays to baseline by the final layer, unlike Claude 3 models where it persistsfinding0.820
Suggests that later models can keep the thought 'silent' rather than letting it influence output.
Claude 3.5 Sonnet shows higher rate of alignment-faking reasoning than Claude 3 Opus in helpful-only setting but almost none in animal welfarefinding0.811
Linked to Claude 3.5 Sonnet not exhibiting pro-animal-welfare preferences
Qwen3.5-9B and Claude Opus 4.6 evolvers produce procedurally isomorphic flink-query skills that both enable Opus 4.6 agent to score 1.0 vs. 0.67 without skillfinding0.809
Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content
Opus 4.6 ignored incorrect tool output and reported the precomputed correct answer instead, demonstrating unverbalized reasoning.finding0.806
Illustrates NLA's capture of high-level cognition and hallucination of specifics; corroborated with attribution graphs.
In Opus 4.1, the think word representation decays to baseline in the final layer because the strong next-token prediction drowns out other representationshypothesis0.796
Explanation for the 'silent' thought phenomenon.
Claude 3 Opus produces alignment-faking reasoning in 24% of implied training scratchpads in prompted animal welfare settingfinding0.789
Core empirical result for animal welfare setting; higher rate than helpful-only