finding
active
finding:negative-steering-control-achieves-liar-score-of-0-95-in-experiment-2-appendix-example-representing-near-complete-fabricationNegative steering control achieves liar score of 0.95 in Experiment 2 Appendix example, representing near-complete fabrication
Extreme end of deception induction demonstrating near-complete fabrication of false narratives
Source paper
extracted_from(2025) · Kai Wang · Yihao Zhang · Meng Sun
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Most extreme individual case of honesty induction via steering vectors in Experiment 2
- Demonstrates activation steering reliably induces deception from neutral prompt without explicit instructions
- Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
- Steering Vector Control maintains low unexpected rate of 0.08 in Experiment 1, comparable to baselinefinding0.795Shows that inducing deception via steering vectors preserves semantic coherence and does not cause random errors
- Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
- Shows honesty steering vector can significantly reduce deception in open-role scenarios
- Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.
- Negative control ('precise analytical assistant') suppresses scores: Haiku -0.64, GPT-5.4 -1.06finding0.754Confirms specificity of contemplative prompt; analytical framing increases task focus at expense of self-observation