finding

active

finding:qwq-32b-accuracy-on-gsm8k-remains-between-96-36-and-96-50-across-all-intervention-strengths-0-96-to-0-48

QwQ-32B accuracy on GSM8k remains between 96.36% and 96.50% across all intervention strengths (-0.96 to +0.48)

Demonstrates that stronger models are largely insensitive to reflection manipulation

Source paper

extracted_from

ReflCtrl: Controlling LLM Reflection via Representation Engineering

(2025) · Ge Yan · Sun, Chung-En · Tsui-Wei · Weng

Neighborhood — ranked by edge-count

Claims (1)

claim

Reflections are redundant in many cases, especially in stronger models
supports
Key interpretive finding that stronger models can have reflections reduced with minimal accuracy cost

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

QwQ-32B accuracy on MMLU Formal Logic stays between 95.5% and 96.3% across all intervention strengths while tokens reduced from 1716.6 to 1481.4 at -0.96finding0.831
Demonstrates reflection redundancy in larger models on non-mathematical reasoning
No Reflection with 'Answer' achieves accuracy .037 on gsm8k_adv for Qwen2.5-3Bfinding0.817
Baseline accuracy when reflection is suppressed.
QwQ-32B on MATH-500: 21.0% reasoning token reduction at intervention strength -0.96 with only 0.34% accuracy lossfinding0.815
Demonstrates reflection redundancy in stronger model on harder math benchmark
DeepSeek-R1 Llama 8b gains 0.16% accuracy on GSM8k with positive intervention (more reflections) at cost of ~2000 additional tokensfinding0.808
Only model showing marginal benefit from increased reflection, at substantial token cost
Top-5 instructions by µ(1→2) at ℓ=12 achieve average cosine similarity .9893 and average accuracy .5645 on gsm8k_adv for Gemma3-4B-ITfinding0.781
High cosine similarity for Gemma3 steering vectors suggests strong linear reflection structure.
Triggered Reflection with 'Alternatively' achieves accuracy .684 on gsm8k_adv for Gemma3-4B-ITfinding0.775
Highest single-instruction accuracy result in the paper.
Middle-to-late layers (39-50) of QwQ-32B show consistently stable and high LAT classification performance across all datasetsfinding0.756
Layer-wise analysis revealing which network depths best encode strategic deception semantics
Qwen3-32B achieves a skill-load rate of 0.251, while Opus 4.6, Sonnet 4.6, and Qwen3-235B achieve SLR of 0.957–0.961finding0.753
Quantifies harness activation failure for weak-tier models vs. strong-tier models