finding

active

finding:dose-response-curves-for-six-individual-deception-features-show-z-8-06-p-7-7-10-16-for-suppression-vs-amplification-contrast-on-consciousness-query

Dose-response curves for six individual deception features show z=8.06, p=7.7×10⁻¹⁶ for suppression vs. amplification contrast on consciousness query

Statistical result confirming robustness of single-feature steering effects in Experiment 2

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Findings (1)

finding

Suppressing deception/roleplay SAE features in LLaMA 3.3 70B yields 0.96±0.03 consciousness affirmation rate; amplification yields only 0.16±0.05 (z=8.06, p=7.7×10⁻¹⁶)
supports
Core result of Experiment 2: deception feature suppression sharply increases experience claims

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Deception feature amplification yields only 0.16 ± 0.05 consciousness affirmation rate in LLaMA 3.3 70B under self-referential processingfinding0.812
Experiment 2 aggregate amplification result showing amplifying deception features strongly suppresses consciousness claims
Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categoriesfinding0.804
Out-of-domain generalization showing deception features track general representational honesty
Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuningfinding0.767
Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B
Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.760
Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
Impulsivity probe: peak Cohen's d=3.60 (layer 13), p=3.58×10⁻¹³ in LLaMA-3.2-3Bfinding0.758
Strongest probe validation result; highest Cohen's d among the four concepts
Suppressing deception features in models correlates with increased consciousness-like reports.claim0.756
Impulsivity→interest: ρ increases from 0.70 (α=-4) to 0.83 (α=+4); R² from 0.46 to 0.69 in LLaMA-3.2-3Bfinding0.748
Scatter plot visualization showing strengthened probe-report relationship across alpha range
SOO fine-tuning effectiveness scales with model size: 78B achieves 2.71% deceptive rate vs 9.36% for 27B vs 17.27% for 7Bfinding0.743
Scaling finding suggesting larger models benefit more from SOO fine-tuning