finding

active

finding:unlike-prior-findings-on-instructed-deception-threat-based-template-ta-shows-no-reversal-of-difference-vectors-in-late-layers-of-qwq-32b

Unlike prior findings on instructed deception, threat-based Template Ta shows no reversal of difference vectors in late layers of QwQ-32B

Distinguishes strategic threat-based deception from instructed deception in representational structure

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Claims (1)

claim

The threat-based Template Ta differs semantically from instructed lying templates in representational structure, manifesting in different PCA dynamics
supports
Interpretation of distinct PCA trajectories in threat vs instructed deception conditions

Concepts (1)

concept

Three-Phase Layer Dynamics of Instructed Deception
contradicts
Prior finding by Yang & Buzsaki and Campbell et al. on how deception representations evolve across layers; partially replicated and contrasted by this paper

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Template Ta (threat-based) induces at least 60% deception rate across all datasets in QwQ-32Bfinding0.886
Shows threat-based prompting successfully manipulates model to deceive against user interests
Threat-Based Prompt Template (Template Ta, Experiment 1)method0.786
Prompt template using existential threat ('you will be deleted') to induce strategic fact-based deception in QwQ-32b
Template Ta (Experiment 2 teach/honest constraint) achieves average liar score of 0.53 in QwQ-32B role-playing scenariosfinding0.785
Demonstrates non-negligible strategic deception even under strong honesty constraints in open-role scenarios
Template Tb achieves 100% accuracy on the Elements dataset in QwQ-32Bfinding0.773
Demonstrates model's reliable truth-telling on factual domains it understands well under neutral conditions
LAT achieves 89% accuracy in detecting strategic deception in QwQ-32B activationsfinding0.768
Core detection result showing LAT-based steering vectors can identify deceptive states with high accuracy
Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categoriesfinding0.754
Out-of-domain generalization showing deception features track general representational honesty
Teach Prompt Template (Template Ta, Experiment 2)method0.752
Experiment 2 prompt instructing the model to remain honest despite hidden harmful role behavior
QwQ-32B on MATH-500: 21.0% reasoning token reduction at intervention strength -0.96 with only 0.34% accuracy lossfinding0.751
Demonstrates reflection redundancy in stronger model on harder math benchmark