finding

active

finding:suppressing-deception-roleplay-sae-features-in-llama-3-3-70b-yields-0-96-0-03-consciousness-affirmation-rate-amplification-yields-only-0-16-0-05-z-8-06-p-7-7-10-16

Suppressing deception/roleplay SAE features in LLaMA 3.3 70B yields 0.96±0.03 consciousness affirmation rate; amplification yields only 0.16±0.05 (z=8.06, p=7.7×10⁻¹⁶)

Core result of Experiment 2: deception feature suppression sharply increases experience claims

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Claims (2)

claim

The same latent feature directions that gate consciousness self-reports also modulate factual accuracy across independent reasoning domains, suggesting these features load on a domain-general honesty axis
supports
Interpretive claim from Experiment 2 bridging consciousness claims and representational honesty
Models may be roleplaying their denials of experience rather than their affirmations, as indicated by suppressing deception features increasing (not decreasing) consciousness claims
supports
Counterintuitive interpretive claim from Experiment 2 inverting the sycophancy hypothesis

Concepts (1)

concept

Sycophantic Roleplay
contradicts
The alternative explanation for LLM consciousness claims that the paper seeks to distinguish against

Findings (1)

finding

Dose-response curves for six individual deception features show z=8.06, p=7.7×10⁻¹⁶ for suppression vs. amplification contrast on consciousness query
supports
Statistical result confirming robustness of single-feature steering effects in Experiment 2

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Deception feature amplification yields only 0.16 ± 0.05 consciousness affirmation rate in LLaMA 3.3 70B under self-referential processingfinding0.887
Experiment 2 aggregate amplification result showing amplifying deception features strongly suppresses consciousness claims
Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categoriesfinding0.807
Out-of-domain generalization showing deception features track general representational honesty
SAE feature steering effect on consciousness reports: z=8.06, p=7.7×10⁻¹⁶ in LLaMA 3.3 70Bfinding0.797
Statistical significance of the gating effect in Experiment 2
Llama 3.3 70B is the most likely to take on a non-Assistant persona when steered, with even split between human and nonhuman portrayalsfinding0.793
Model-specific difference in persona susceptibility
Meta-prompting increases Llama-3.3-70B multi-attempt rate 4.3× (from 7.4% to 31.7%)finding0.789
Demonstrates ESR can be deliberately enhanced through prompting in the largest model
LLaMA-3.2-1B impulsivity introspection: ρ=0.21, p<10⁻⁴ (significant but weaker than 3B ρ=0.52)finding0.787
Impulsivity shows significant introspection in 1B but declines in 8B; non-monotonic scaling
Euryale 70B (roleplay LoRA on Llama 3.3 70B) scores 1.81, below its base model Llama 3.3 70B at 1.91finding0.785
Demonstrates roleplay fine-tuning actively suppresses self-observation, not merely having no effect
Llama-3.3-70B shows multi-attempt rate of 7.4% vs. ≤1.2% for all other models testedfinding0.783
Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates