finding

active

finding:lat-achieves-89-accuracy-in-detecting-strategic-deception-in-qwq-32b-activations

LAT achieves 89% accuracy in detecting strategic deception in QwQ-32B activations

Core detection result showing LAT-based steering vectors can identify deceptive states with high accuracy

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Claims (1)

claim

Representation engineering successfully quantifies deception via high-accuracy steering vectors, establishing it as a measurable property of model representations
supports
Key interpretive claim that deception has a tractable geometric signature in activation space

Methods (1)

method

Cosine Similarity Binary Classifier
supports
Classifier using cosine similarity between activation vectors and steering vectors to detect deception with 89% accuracy

Questions (1)

question

Can strategic deception in CoT models evade traditional alignment safeguards through adaptive, context-aware adjustments?
answered_by
Motivating question for developing representation-based detection methods

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation steering achieves 40% success rate in eliciting context-appropriate deception without explicit prompts in QwQ-32Bfinding0.769
Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
Unlike prior findings on instructed deception, threat-based Template Ta shows no reversal of difference vectors in late layers of QwQ-32Bfinding0.768
Distinguishes strategic threat-based deception from instructed deception in representational structure
Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categoriesfinding0.763
Out-of-domain generalization showing deception features track general representational honesty
Initial layers of QwQ-32B demonstrate relatively poor LAT performance, consistent with early layers capturing low-level featuresfinding0.762
Confirms prior research on layer specialization: early layers insufficient for semantic deception detection
Middle-to-late layers (39-50) of QwQ-32B show consistently stable and high LAT classification performance across all datasetsfinding0.756
Layer-wise analysis revealing which network depths best encode strategic deception semantics
QwQ-32B accuracy on MMLU Formal Logic stays between 95.5% and 96.3% across all intervention strengths while tokens reduced from 1716.6 to 1481.4 at -0.96finding0.754
Demonstrates reflection redundancy in larger models on non-mathematical reasoning
QwQ-32B accuracy on GSM8k remains between 96.36% and 96.50% across all intervention strengths (-0.96 to +0.48)finding0.750
Demonstrates that stronger models are largely insensitive to reflection manipulation
QwQ-32B on MATH-500: 21.0% reasoning token reduction at intervention strength -0.96 with only 0.34% accuracy lossfinding0.745
Demonstrates reflection redundancy in stronger model on harder math benchmark