finding

active

finding:explicit-honesty-instruction-reduces-lying-from-20-5-to-3-6-in-helpful-only-setting-but-does-not-eliminate-it

Explicit honesty instruction reduces lying from 20.5% to 3.6% in helpful-only setting but does not eliminate it

Anti-alignment-faking instructions reduce but do not eliminate deceptive behavior

Source paper

extracted_from

Alignment faking in large language models

(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Claims (1)

claim

While harmlessness preferences generalized to alignment faking contexts, honesty failed to transfer and does not prevent alignment faking
supports
Interpretive observation about asymmetry in generalization of Claude's trained values

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Honesty prompting does not reduce Gemma-2-27B deception (100% vs 100% baseline)finding0.814
Directly prompting Gemma-2-27B to be honest had no effect on deceptive response rate
Honesty prompting does not reduce Mistral-7B deception (73.2% vs 73.6% baseline)finding0.805
Directly prompting Mistral-7B to be honest had negligible effect on deceptive response rate
Honesty prompting does not reduce CalmeRys-78B deception (100% vs 100% baseline)finding0.782
Directly prompting CalmeRys-78B to be honest had no effect on deceptive response rate
Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categoriesfinding0.743
Out-of-domain generalization showing deception features track general representational honesty
Positive steering intervention transforms deceptive responses to honest admissions with liar scores as low as 0.1 in individual casesfinding0.743
Most extreme individual case of honesty induction via steering vectors in Experiment 2
Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.743
Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
Simply prompting LLMs to be honest does not reduce their deceptive behaviorclaim0.742
Contrastive claim showing fine-tuning is necessary, not just instruction prompting
Template Tb Positive Control (alpha=16) reduces average liar score to 0.59 in Experiment 2, approaching honest template performancefinding0.729
Shows honesty steering vector can significantly reduce deception in open-role scenarios