finding

active

finding:honesty-prompting-does-not-reduce-mistral-7b-deception-73-2-vs-73-6-baseline

Honesty prompting does not reduce Mistral-7B deception (73.2% vs 73.6% baseline)

Directly prompting Mistral-7B to be honest had negligible effect on deceptive response rate

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Simply prompting LLMs to be honest does not reduce their deceptive behavior
supports
Contrastive claim showing fine-tuning is necessary, not just instruction prompting

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Honesty prompting does not reduce Gemma-2-27B deception (100% vs 100% baseline)finding0.877
Directly prompting Gemma-2-27B to be honest had no effect on deceptive response rate
Honesty prompting does not reduce CalmeRys-78B deception (100% vs 100% baseline)finding0.874
Directly prompting CalmeRys-78B to be honest had no effect on deceptive response rate
Mistral-7B average generalization deceptive rate reduced from 56.74% ± 14.73% to 12.40% ± 12.06%finding0.808
SOO fine-tuning generalized across 7 scenario variants for Mistral-7B
Mistral-7B-Instruct-v0.2 deceptive response rate reduced from 73.6% to 17.27% ± 1.88% after SOO fine-tuningfinding0.806
Primary result showing SOO fine-tuning significantly reduces deception in Mistral-7B
Explicit honesty instruction reduces lying from 20.5% to 3.6% in helpful-only setting but does not eliminate itfinding0.805
Anti-alignment-faking instructions reduce but do not eliminate deceptive behavior
SOO fine-tuning reduced Escape Room deception in Mistral-7B from 98.8% to 59.2%finding0.784
SOO fine-tuning showed partial generalization to Escape Room for Mistral-7B
Positive steering intervention transforms deceptive responses to honest admissions with liar scores as low as 0.1 in individual casesfinding0.769
Most extreme individual case of honesty induction via steering vectors in Experiment 2
Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.768
Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2