finding

active

finding:middle-to-late-layers-39-50-of-qwq-32b-show-consistently-stable-and-high-lat-classification-performance-across-all-datasets

Middle-to-late layers (39-50) of QwQ-32B show consistently stable and high LAT classification performance across all datasets

Layer-wise analysis revealing which network depths best encode strategic deception semantics

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Claims (1)

claim

Different network depths contribute differentially to the model's capacity for handling deceptive patterns, with middle-to-late layers specializing in abstract deception semantics
supports
Interpretation of LAT scanning results showing layer-dependent deception detection accuracy

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Initial layers of QwQ-32B demonstrate relatively poor LAT performance, consistent with early layers capturing low-level featuresfinding0.888
Confirms prior research on layer specialization: early layers insufficient for semantic deception detection
Mid-layers (6-15) achieve peak anchoring because semantic structure differentiates while maintaining coherence, forming a Goldilocks zoneclaim0.757
Interpretation of E3 layer-wise results; motivates targeted UCCT interventions at layers 8-12
QwQ-32B accuracy on GSM8k remains between 96.36% and 96.50% across all intervention strengths (-0.96 to +0.48)finding0.756
Demonstrates that stronger models are largely insensitive to reflection manipulation
LAT achieves 89% accuracy in detecting strategic deception in QwQ-32B activationsfinding0.756
Core detection result showing LAT-based steering vectors can identify deceptive states with high accuracy
Layer 24 (indexed at 8) of LLaMA3.1-8B on Hinting satisfies Criteria 1 and 2 under both IIT 3.0 and IIT 4.0 (temporal permutation).finding0.750
One of the most promising cases; approximately corresponds to the 2/3 layer of LLaMA3.1-8B.
The model converges to a more stable truth direction in middle-to-late layers, as evidenced by increasing cosine similarity between layer-wise probes.claim0.750
Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
Math and code tasks show strongest mid-layer anchoring on LLaMA (S ≈ −1.65 at layers 8-12)finding0.750
Task-specific E3 finding showing compositional reasoning requires deeper processing
Layer 29 (indexed at 10) of LLaMA3.1-8B on Strange Stories (2 scores) satisfies Criteria 1 and 2 under IIT 4.0 (temporal permutation).finding0.748
Third promising case from temporal permutation analysis.