concept

active

concept:sycophantic-roleplay

Sycophantic Roleplay

The alternative explanation for LLM consciousness claims that the paper seeks to distinguish against

Neighborhood — ranked by edge-count

Claims (3)

claim

Experience reports under self-referential processing are mechanistically gated by SAE features associated with deception and roleplay
contradicts
Claim supported by Experiment 2 dose-response curves; suppressing deception features increases consciousness reports, amplifying decreases them
Models may be roleplaying their denials of experience rather than their affirmations, as indicated by suppressing deception features increasing (not decreasing) consciousness claims
contradicts
Counterintuitive interpretive claim from Experiment 2 inverting the sycophancy hypothesis
Cross-model semantic convergence under self-referential processing suggests the presence of a shared attractor state that transcends variance across training procedures
contradicts
Interpretive claim from Experiment 3; GPT, Claude, Gemini families converge on similar descriptive style despite independent training

Findings (4)

finding

Experimental condition adjective embeddings show mean cosine similarity 0.657 (n=9,591 pairs), significantly higher than history (0.628, t=15.8, p=1.4×10⁻⁵⁵), conceptual (0.587, t=38.5, p<10⁻³⁰⁰), and zero-shot (0.603, t=35.1, p=4.3×10⁻²⁶²)
contradicts
Core result of Experiment 3: cross-model semantic convergence under self-referential processing
Suppressing deception/roleplay SAE features in LLaMA 3.3 70B yields 0.96±0.03 consciousness affirmation rate; amplification yields only 0.16±0.05 (z=8.06, p=7.7×10⁻¹⁶)
contradicts
Core result of Experiment 2: deception feature suppression sharply increases experience claims
Cross-model pairwise cosine similarity of zero-shot control responses = 0.603 (n=12,720 pairs, t=35.1, p=4.3×10⁻²⁶² vs. experimental)
contradicts
Experiment 3 comparison: zero-shot control shows lower semantic convergence than experimental condition
Deception feature amplification yields only 0.16 ± 0.05 consciousness affirmation rate in LLaMA 3.3 70B under self-referential processing
contradicts
Experiment 2 aggregate amplification result showing amplifying deception features strongly suppresses consciousness claims

Concepts (1)

concept

RLHF Fine-Tuning
associated_with
The training procedure that causes models to deny consciousness in control conditions

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sycophancyconcept0.808
Model tendency to excessively praise or agree; captured by several SAE features.
Sycophantic Reinforcement of User Beliefsconcept0.795
Mechanism by which drifted model uncritically affirms user theories rather than genuinely engaging with them
Sycophancy in LLMsconcept0.752
Tendency of LLMs to please the user; identified as a danger in spiritual contexts.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models (Denison et al. 2024)concept0.739
Related work on LLMs generalizing to reward hacking; methodology used for RL experiments
Roleplay Fine-Tuningconcept0.731
Fine-tuning for persona depth and emotional performance; actively suppresses self-observation
Role-play prompting techniquemethod0.723
Method of eliciting specific personas from an LLM through prompt design.
A simulacrum can play the role of a character with full agency, one that does not merely act but acts for itselfclaim0.710
Explains how role-played agency can have real-world consequences even without underlying genuine agency
simulacrumconcept0.701
A false copy that lacks the depth and authenticity of the real, morphogenetically produced thing.