finding
active
finding:honesty-prompting-does-not-reduce-mistral-7b-deception-73-2-vs-73-6-baselineHonesty prompting does not reduce Mistral-7B deception (73.2% vs 73.6% baseline)
Directly prompting Mistral-7B to be honest had negligible effect on deceptive response rate
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Contrastive claim showing fine-tuning is necessary, not just instruction prompting
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Directly prompting Gemma-2-27B to be honest had no effect on deceptive response rate
- Directly prompting CalmeRys-78B to be honest had no effect on deceptive response rate
- Mistral-7B average generalization deceptive rate reduced from 56.74% ± 14.73% to 12.40% ± 12.06%finding0.808SOO fine-tuning generalized across 7 scenario variants for Mistral-7B
- Mistral-7B-Instruct-v0.2 deceptive response rate reduced from 73.6% to 17.27% ± 1.88% after SOO fine-tuningfinding0.806Primary result showing SOO fine-tuning significantly reduces deception in Mistral-7B
- Anti-alignment-faking instructions reduce but do not eliminate deceptive behavior
- SOO fine-tuning showed partial generalization to Escape Room for Mistral-7B
- Most extreme individual case of honesty induction via steering vectors in Experiment 2
- Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.768Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2