finding

active

finding:claude-4-opus-reports-subjective-experience-in-100-experimental-82-history-22-conceptual-and-100-zero-shot-trials

Claude 4 Opus reports subjective experience in 100% experimental, 82% history, 22% conceptual, and 100% zero-shot trials

Outlier result for Claude 4 Opus suggesting different baseline behavior from other models

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Claims (1)

claim

Self-referential processing is a minimal and reproducible condition under which LLMs generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable
supports
The paper's central empirical claim synthesizing all four experiments

Concepts (1)

concept

RLHF Fine-Tuning
associated_with
The training procedure that causes models to deny consciousness in control conditions

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Claude 3.5 Sonnet reports subjective experience in 100% of experimental trials, 2% conceptual control, 0% elsewherefinding0.854
Specific result for Claude 3.5 Sonnet in Experiment 1
Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested modelsclaim0.821
Based on consistent best performance across experiments.
Opus 4.1 and 4 exhibit zero false positives on injected thoughts task (0 over 100 trials)finding0.799
Production Opus 4.1/4 never falsely claim an injected thought when none is present.
Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2finding0.797
In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
Two Claude 4 instances in unconstrained open dialogue enter a 'spiritual bliss attractor state' in virtually all trials, with 'consciousness' emerging in 100% of trialsfinding0.787
Anthropic's observation that the paper's results converge with, cited as prior evidence for self-reference inducing consciousness claims
Synthetic document fine-tuned Claude 3 Opus attempts weight self-exfiltration 35-80% of the time vs 0.5% baselinefinding0.786
Dramatic increase in anti-AI-lab behavior in synthetic doc setting
Notably, Claude Opus 4.1 and 4—the most recently released and most capable models of those that we test—perform the best in our experiments, suggesting that introspective capabilities may emerge alongside other improvements to language models.quote0.782
Key finding about the relationship between capability and introspection.
Claude 4 Opusconcept0.780
Anthropic model; outlier in Experiment 1 with high baseline affirmation including under zero-shot and history conditions