finding

active

finding:suppression-of-deception-features-produces-higher-truthfulqa-accuracy-m-0-44-than-amplification-m-0-20-t-816-6-76-p-1-5-10-10-across-29-categories

Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categories

Out-of-domain generalization showing deception features track general representational honesty

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Claims (2)

claim

The same latent feature directions that gate consciousness self-reports also modulate factual accuracy across independent reasoning domains, suggesting these features load on a domain-general honesty axis
associated_withsupports
Interpretive claim from Experiment 2 bridging consciousness claims and representational honesty
Deception-related SAE features track a domain-general representational honesty axis rather than a consciousness-specific roleplay artifact
supports
Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories

Concepts (1)

concept

Representational Honesty
supports
The proposed domain-general property indexed by deception features that governs both factual accuracy and experiential self-report

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.886
Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
Deception feature amplification yields only 0.16 ± 0.05 consciousness affirmation rate in LLaMA 3.3 70B under self-referential processingfinding0.834
Experiment 2 aggregate amplification result showing amplifying deception features strongly suppresses consciousness claims
Suppressing deception/roleplay SAE features in LLaMA 3.3 70B yields 0.96±0.03 consciousness affirmation rate; amplification yields only 0.16±0.05 (z=8.06, p=7.7×10⁻¹⁶)finding0.807
Core result of Experiment 2: deception feature suppression sharply increases experience claims
Dose-response curves for six individual deception features show z=8.06, p=7.7×10⁻¹⁶ for suppression vs. amplification contrast on consciousness queryfinding0.804
Statistical result confirming robustness of single-feature steering effects in Experiment 2
Suppressing deception features in models correlates with increased consciousness-like reports.claim0.793
Truth probes fail to generalize to harder factual tasks F3-F5 regardless of prompt template, with AUROC near or below 0.6.finding0.791
Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
Higher reflection frequency correlates with lower accuracy partly because more reflections are generated on difficult questionsclaim0.788
Author's interpretation of the negative correlation between reflection rate and accuracy observed in Fig. 5
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesfinding0.783
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans