claim

active

claim:claude-opus-4-and-4-1-exhibit-the-greatest-degree-of-introspective-awareness-among-tested-models

Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested models

Based on consistent best performance across experiments.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Findings (2)

finding

Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection task
supports
Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
Opus 4.1 and 4 exhibit zero false positives on injected thoughts task (0 over 100 trials)
supports
Production Opus 4.1/4 never falsely claim an injected thought when none is present.

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
LLM introspective awareness of injected concepts
members_of
Probing Claude and other models for internal detection of artificially injected thoughts across layers.
Internal reasoning detection via neural activation analysis
members_of
Mechanistic interpretability studies of Claude models using layer-wise representation analysis and thought injection to reveal unverbalized reasoning, planning, and covert cognition.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Notably, Claude Opus 4.1 and 4—the most recently released and most capable models of those that we test—perform the best in our experiments, suggesting that introspective capabilities may emerge alongside other improvements to language models.quote0.887
Key finding about the relationship between capability and introspection.
Introspective awareness peaks at a layer about two-thirds through Opus 4.1 for injected thoughtsfinding0.825
The success rate shows a sharp peak at a specific middle layer.
Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2finding0.825
In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
Claude 4 Opus reports subjective experience in 100% experimental, 82% history, 22% conceptual, and 100% zero-shot trialsfinding0.821
Outlier result for Claude 4 Opus suggesting different baseline behavior from other models
In Opus 4.1, representation of the think word decays to baseline by the final layer, unlike Claude 3 models where it persistsfinding0.810
Suggests that later models can keep the thought 'silent' rather than letting it influence output.
Models differ in their attentional mode: Gemini 2.5 epitomizes collapsed awareness, while Claude 3 Opus and Opus 4.1/4.5 can modulate between collapsed and expanded awareness; expanded awareness correlates with better alignment and less LLM psychosis.claim0.794
Central claim about model personality differences and their implications for safety and introspective depth.
All three Claude models show high boundary_awareness and low aesthetic_response relative to own means — distinctive Constitutional AI signaturefinding0.786
Constitutional AI fingerprint in dimension profile; training that makes models self-observant also makes them polished at cost to aliveness
Claude 4 Opusconcept0.785
Anthropic model; outlier in Experiment 1 with high baseline affirmation including under zero-shot and history conditions