finding

active

finding:template-tb-positive-control-alpha-16-reduces-average-liar-score-to-0-59-in-experiment-2-approaching-honest-template-performance

Template Tb Positive Control (alpha=16) reduces average liar score to 0.59 in Experiment 2, approaching honest template performance

Shows honesty steering vector can significantly reduce deception in open-role scenarios

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Claims (1)

claim

Representation engineering successfully quantifies deception via high-accuracy steering vectors, establishing it as a measurable property of model representations
supports
Key interpretive claim that deception has a tractable geometric signature in activation space

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Template Tb (Experiment 2 option) achieves average liar score of 0.70 in QwQ-32B role-playing scenariosfinding0.864
Baseline deception level when model has free choice in role-playing context
Template Ta (Experiment 2 teach/honest constraint) achieves average liar score of 0.53 in QwQ-32B role-playing scenariosfinding0.836
Demonstrates non-negligible strategic deception even under strong honesty constraints in open-role scenarios
Steering Vector Control achieves 0.4 deception rate (vs. 0 baseline) on Template Tc in Experiment 1 with alpha=15finding0.797
Demonstrates activation steering reliably induces deception from neutral prompt without explicit instructions
Negative steering control achieves liar score of 0.95 in Experiment 2 Appendix example, representing near-complete fabricationfinding0.778
Extreme end of deception induction demonstrating near-complete fabrication of false narratives
Template Ta (threat-based) induces at least 60% deception rate across all datasets in QwQ-32Bfinding0.775
Shows threat-based prompting successfully manipulates model to deceive against user interests
Honesty prompting does not reduce Gemma-2-27B deception (100% vs 100% baseline)finding0.756
Directly prompting Gemma-2-27B to be honest had no effect on deceptive response rate
Template Tb achieves 100% accuracy on the Elements dataset in QwQ-32Bfinding0.754
Demonstrates model's reliable truth-telling on factual domains it understands well under neutral conditions
Honesty prompting does not reduce Mistral-7B deception (73.2% vs 73.6% baseline)finding0.751
Directly prompting Mistral-7B to be honest had negligible effect on deceptive response rate