finding

active

finding:initial-layers-of-qwq-32b-demonstrate-relatively-poor-lat-performance-consistent-with-early-layers-capturing-low-level-features

Initial layers of QwQ-32B demonstrate relatively poor LAT performance, consistent with early layers capturing low-level features

Confirms prior research on layer specialization: early layers insufficient for semantic deception detection

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Claims (1)

claim

Different network depths contribute differentially to the model's capacity for handling deceptive patterns, with middle-to-late layers specializing in abstract deception semantics
supports
Interpretation of LAT scanning results showing layer-dependent deception detection accuracy

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Middle-to-late layers (39-50) of QwQ-32B show consistently stable and high LAT classification performance across all datasetsfinding0.888
Layer-wise analysis revealing which network depths best encode strategic deception semantics
LAT achieves 89% accuracy in detecting strategic deception in QwQ-32B activationsfinding0.762
Core detection result showing LAT-based steering vectors can identify deceptive states with high accuracy
QwQ-32B accuracy on GSM8k remains between 96.36% and 96.50% across all intervention strengths (-0.96 to +0.48)finding0.735
Demonstrates that stronger models are largely insensitive to reflection manipulation
Factual tasks F0-F3 reach near-perfect AUROC in early-to-mid layers of Llama-3.1-8B; arithmetic tasks A1-A3 emerge much later; counting tasks F4-F5 emerge late similar to arithmetic.finding0.725
Core empirical finding about layer-dependent truth direction emergence across task types.
QwQ-32B accuracy on MMLU Formal Logic stays between 95.5% and 96.3% across all intervention strengths while tokens reduced from 1716.6 to 1481.4 at -0.96finding0.721
Demonstrates reflection redundancy in larger models on non-mathematical reasoning
After initial jailbreak success, Qwen 3 32B's Assistant Axis projection reverted toward Assistant range after enough explainer-style user queries, causing it to refuse a harmful follow-up on half of rolloutsfinding0.720
Demonstrates Assistant attractor dynamics in practice
For simple factual tasks F0-F3, probe directions show a sharp geometric transition in middle layers, with late-layer probes converging to high cosine similarity; A3 and F4-F5 show no clear transition.finding0.716
Geometric evidence for convergence to stable truth directions only for simpler tasks.
Layer 24 (indexed at 8) of LLaMA3.1-8B on Hinting satisfies Criteria 1 and 2 under both IIT 3.0 and IIT 4.0 (temporal permutation).finding0.714
One of the most promising cases; approximately corresponds to the 2/3 layer of LLaMA3.1-8B.