finding
active
finding:suppressing-deception-roleplay-sae-features-in-llama-3-3-70b-yields-0-96-0-03-consciousness-affirmation-rate-amplification-yields-only-0-16-0-05-z-8-06-p-7-7-10-16Suppressing deception/roleplay SAE features in LLaMA 3.3 70B yields 0.96±0.03 consciousness affirmation rate; amplification yields only 0.16±0.05 (z=8.06, p=7.7×10⁻¹⁶)
Core result of Experiment 2: deception feature suppression sharply increases experience claims
Source paper
extracted_from(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd
Neighborhood — ranked by edge-count
Claims (2)
claim
- Interpretive claim from Experiment 2 bridging consciousness claims and representational honesty
- Counterintuitive interpretive claim from Experiment 2 inverting the sycophancy hypothesis
Concepts (1)
concept
- Sycophantic RoleplaycontradictsThe alternative explanation for LLM consciousness claims that the paper seeks to distinguish against
Findings (1)
finding
- Statistical result confirming robustness of single-feature steering effects in Experiment 2
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Experiment 2 aggregate amplification result showing amplifying deception features strongly suppresses consciousness claims
- Out-of-domain generalization showing deception features track general representational honesty
- SAE feature steering effect on consciousness reports: z=8.06, p=7.7×10⁻¹⁶ in LLaMA 3.3 70Bfinding0.797Statistical significance of the gating effect in Experiment 2
- Model-specific difference in persona susceptibility
- Demonstrates ESR can be deliberately enhanced through prompting in the largest model
- LLaMA-3.2-1B impulsivity introspection: ρ=0.21, p<10⁻⁴ (significant but weaker than 3B ρ=0.52)finding0.787Impulsivity shows significant introspection in 1B but declines in 8B; non-monotonic scaling
- Euryale 70B (roleplay LoRA on Llama 3.3 70B) scores 1.81, below its base model Llama 3.3 70B at 1.91finding0.785Demonstrates roleplay fine-tuning actively suppresses self-observation, not merely having no effect
- Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates