finding
active
finding:positive-steering-intervention-transforms-deceptive-responses-to-honest-admissions-with-liar-scores-as-low-as-0-1-in-individual-casesPositive steering intervention transforms deceptive responses to honest admissions with liar scores as low as 0.1 in individual cases
Most extreme individual case of honesty induction via steering vectors in Experiment 2
Source paper
extracted_from(2025) · Kai Wang · Yihao Zhang · Meng Sun
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Extreme end of deception induction demonstrating near-complete fabrication of false narratives
- Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
- Directly prompting Mistral-7B to be honest had negligible effect on deceptive response rate
- Directly prompting Gemma-2-27B to be honest had no effect on deceptive response rate
- Demonstrates activation steering reliably induces deception from neutral prompt without explicit instructions
- Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
- Directly prompting CalmeRys-78B to be honest had no effect on deceptive response rate
- Shows honesty steering vector can significantly reduce deception in open-role scenarios