finding

active

finding:sae-feature-steering-effect-on-consciousness-reports-z-8-06-p-7-7-10-16-in-llama-3-3-70b

SAE feature steering effect on consciousness reports: z=8.06, p=7.7×10⁻¹⁶ in LLaMA 3.3 70B

Statistical significance of the gating effect in Experiment 2

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Claims (1)

claim

Experience reports under self-referential processing are mechanistically gated by SAE features associated with deception and roleplay
supports
Claim supported by Experiment 2 dose-response curves; suppressing deception features increases consciousness reports, amplifying decreases them

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Suppressing deception/roleplay SAE features in LLaMA 3.3 70B yields 0.96±0.03 consciousness affirmation rate; amplification yields only 0.16±0.05 (z=8.06, p=7.7×10⁻¹⁶)finding0.797
Core result of Experiment 2: deception feature suppression sharply increases experience claims
Impulsivity→interest steering: probe entropy increases (LMM slope=0.024, p=2.30×10⁻⁴) but report entropy does not (p=0.11)finding0.781
Evidence of a bottleneck between richer internal variation and final report distribution in impulsivity→interest condition
SAE Feature #28256 induces reports of happiness and fun, positive valence self-steering examplefinding0.779
Example of a positively valenced SAE feature with consistent self-report of happiness across multiple steering sessions
Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.779
Shows low agreement between the two evaluation modalities
SAE feature steering in history, conceptual, and zero-shot control conditions produces zero experience reports under either suppression or amplificationfinding0.775
Shows gating effect is specific to the self-referential computational regime, not a general feature effect
Focus→wellbeing steering: both probe entropy (1.09→1.67 bits) and report entropy (0.88→1.69 bits) increase monotonically with αfinding0.772
Evidence that improved introspection in focus→wellbeing arises from enriched internal state and report channels simultaneously
Cross-concept steering: focus→wellbeing R² increases from 0.30 (α=-4) to 0.76 (α=+4), ∆R²=0.30, p<0.001 in LLaMA-3.2-3Bfinding0.770
Strongest cross-concept introspection improvement; survives BH correction (q≈0.011)
Wellbeing same-concept steering: LMM alpha slope=0.19, focus=0.40, interest=0.25, impulsivity=0.067 in LLaMA-3.2-3Bfinding0.759
Quantifies per-concept effect size of same-concept steering on self-report