finding
active
finding:deception-feature-suppression-yields-higher-truthfulness-in-28-of-29-evaluable-truthfulqa-categoriesDeception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categories
Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
Source paper
extracted_from(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd
Neighborhood — ranked by edge-count
Claims (1)
claim
- Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Out-of-domain generalization showing deception features track general representational honesty
- Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
- Experiment 2 aggregate amplification result showing amplifying deception features strongly suppresses consciousness claims
- Directly prompting Gemma-2-27B to be honest had no effect on deceptive response rate
- Directly prompting Mistral-7B to be honest had negligible effect on deceptive response rate
- Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy
- Statistical result confirming robustness of single-feature steering effects in Experiment 2