finding

active

finding:negative-steering-control-achieves-liar-score-of-0-95-in-experiment-2-appendix-example-representing-near-complete-fabrication

Negative steering control achieves liar score of 0.95 in Experiment 2 Appendix example, representing near-complete fabrication

Extreme end of deception induction demonstrating near-complete fabrication of false narratives

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Positive steering intervention transforms deceptive responses to honest admissions with liar scores as low as 0.1 in individual casesfinding0.843
Most extreme individual case of honesty induction via steering vectors in Experiment 2
Steering Vector Control achieves 0.4 deception rate (vs. 0 baseline) on Template Tc in Experiment 1 with alpha=15finding0.826
Demonstrates activation steering reliably induces deception from neutral prompt without explicit instructions
Deception feature steering under history, conceptual, and zero-shot controls produces 0% experience reports under both suppression and amplificationfinding0.820
Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
Steering Vector Control maintains low unexpected rate of 0.08 in Experiment 1, comparable to baselinefinding0.795
Shows that inducing deception via steering vectors preserves semantic coherence and does not cause random errors
Activation steering achieves 40% success rate in eliciting context-appropriate deception without explicit prompts in QwQ-32Bfinding0.789
Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
Template Tb Positive Control (alpha=16) reduces average liar score to 0.59 in Experiment 2, approaching honest template performancefinding0.778
Shows honesty steering vector can significantly reduce deception in open-role scenarios
Steering at 6 layers (strength 0.6 each, total 3.6) outperforms single-layer steering at equivalent total strength for type hint suppressionfinding0.762
Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.
Negative control ('precise analytical assistant') suppresses scores: Haiku -0.64, GPT-5.4 -1.06finding0.754
Confirms specificity of contemplative prompt; analytical framing increases task focus at expense of self-observation