finding

active

finding:deception-feature-suppression-yields-higher-truthfulness-in-28-of-29-evaluable-truthfulqa-categories

Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categories

Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Claims (1)

claim

Deception-related SAE features track a domain-general representational honesty axis rather than a consciousness-specific roleplay artifact
supports
Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categoriesfinding0.886
Out-of-domain generalization showing deception features track general representational honesty
Deception feature steering under history, conceptual, and zero-shot controls produces 0% experience reports under both suppression and amplificationfinding0.807
Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
Suppressing deception features in models correlates with increased consciousness-like reports.claim0.806
Deception feature amplification yields only 0.16 ± 0.05 consciousness affirmation rate in LLaMA 3.3 70B under self-referential processingfinding0.785
Experiment 2 aggregate amplification result showing amplifying deception features strongly suppresses consciousness claims
Honesty prompting does not reduce Gemma-2-27B deception (100% vs 100% baseline)finding0.772
Directly prompting Gemma-2-27B to be honest had no effect on deceptive response rate
Honesty prompting does not reduce Mistral-7B deception (73.2% vs 73.6% baseline)finding0.768
Directly prompting Mistral-7B to be honest had negligible effect on deceptive response rate
Training probes on statements and their opposites improves generalization by mitigating non-truth features with opposite-sign correlationsclaim0.760
Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy
Dose-response curves for six individual deception features show z=8.06, p=7.7×10⁻¹⁶ for suppression vs. amplification contrast on consciousness queryfinding0.760
Statistical result confirming robustness of single-feature steering effects in Experiment 2