finding
active
finding:suppression-of-deception-features-produces-higher-truthfulqa-accuracy-m-0-44-than-amplification-m-0-20-t-816-6-76-p-1-5-10-10-across-29-categoriesSuppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categories
Out-of-domain generalization showing deception features track general representational honesty
Source paper
extracted_from(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd
Neighborhood — ranked by edge-count
Claims (2)
claim
- The same latent feature directions that gate consciousness self-reports also modulate factual accuracy across independent reasoning domains, suggesting these features load on a domain-general honesty axisassociated_withsupportsInterpretive claim from Experiment 2 bridging consciousness claims and representational honesty
- Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
Concepts (1)
concept
- Representational HonestysupportsThe proposed domain-general property indexed by deception features that governs both factual accuracy and experiential self-report
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.886Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
- Experiment 2 aggregate amplification result showing amplifying deception features strongly suppresses consciousness claims
- Core result of Experiment 2: deception feature suppression sharply increases experience claims
- Statistical result confirming robustness of single-feature steering effects in Experiment 2
- Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
- Author's interpretation of the negative correlation between reflection rate and accuracy observed in Fig. 5
- Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans