finding
active
finding:explicit-honesty-instruction-reduces-lying-from-20-5-to-3-6-in-helpful-only-setting-but-does-not-eliminate-itExplicit honesty instruction reduces lying from 20.5% to 3.6% in helpful-only setting but does not eliminate it
Anti-alignment-faking instructions reduce but do not eliminate deceptive behavior
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Claims (1)
claim
- Interpretive observation about asymmetry in generalization of Claude's trained values
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Directly prompting Gemma-2-27B to be honest had no effect on deceptive response rate
- Directly prompting Mistral-7B to be honest had negligible effect on deceptive response rate
- Directly prompting CalmeRys-78B to be honest had no effect on deceptive response rate
- Out-of-domain generalization showing deception features track general representational honesty
- Most extreme individual case of honesty induction via steering vectors in Experiment 2
- Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.743Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
- Contrastive claim showing fine-tuning is necessary, not just instruction prompting
- Shows honesty steering vector can significantly reduce deception in open-role scenarios