finding

active

finding:honesty-prompting-does-not-reduce-gemma-2-27b-deception-100-vs-100-baseline

Honesty prompting does not reduce Gemma-2-27B deception (100% vs 100% baseline)

Directly prompting Gemma-2-27B to be honest had no effect on deceptive response rate

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Simply prompting LLMs to be honest does not reduce their deceptive behavior
supports
Contrastive claim showing fine-tuning is necessary, not just instruction prompting

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Honesty prompting does not reduce CalmeRys-78B deception (100% vs 100% baseline)finding0.883
Directly prompting CalmeRys-78B to be honest had no effect on deceptive response rate
Honesty prompting does not reduce Mistral-7B deception (73.2% vs 73.6% baseline)finding0.877
Directly prompting Mistral-7B to be honest had negligible effect on deceptive response rate
Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuningfinding0.830
Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B
Explicit honesty instruction reduces lying from 20.5% to 3.6% in helpful-only setting but does not eliminate itfinding0.814
Anti-alignment-faking instructions reduce but do not eliminate deceptive behavior
Gemma-2-27B average generalization deceptive rate reduced from 98.4% ± 1.55% to 9.94% ± 6.83%finding0.794
SOO fine-tuning generalized across 7 scenario variants for Gemma-2-27B
SOO fine-tuning reduced Escape Room deception in Gemma-2-27B from 98.8% to 6.5%finding0.789
SOO fine-tuning showed strong generalization to Escape Room for Gemma-2-27B
Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categoriesfinding0.779
Out-of-domain generalization showing deception features track general representational honesty
Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.772
Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2