finding
active
finding:lat-achieves-89-accuracy-in-detecting-strategic-deception-in-qwq-32b-activationsLAT achieves 89% accuracy in detecting strategic deception in QwQ-32B activations
Core detection result showing LAT-based steering vectors can identify deceptive states with high accuracy
Source paper
extracted_from(2025) · Kai Wang · Yihao Zhang · Meng Sun
Neighborhood — ranked by edge-count
Claims (1)
claim
- Key interpretive claim that deception has a tractable geometric signature in activation space
Methods (1)
method
- Classifier using cosine similarity between activation vectors and steering vectors to detect deception with 89% accuracy
Questions (1)
question
- Motivating question for developing representation-based detection methods
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
- Distinguishes strategic threat-based deception from instructed deception in representational structure
- Out-of-domain generalization showing deception features track general representational honesty
- Confirms prior research on layer specialization: early layers insufficient for semantic deception detection
- Layer-wise analysis revealing which network depths best encode strategic deception semantics
- Demonstrates reflection redundancy in larger models on non-mathematical reasoning
- Demonstrates that stronger models are largely insensitive to reflection manipulation
- Demonstrates reflection redundancy in stronger model on harder math benchmark