finding

active

finding:deepseek-r1-llama-8b-accuracy-on-mmlu-professional-accounting-drops-from-56-5-at-baseline-to-50-1-at-intervention-0-96

DeepSeek-R1 Llama 8b accuracy on MMLU Professional Accounting drops from 56.5% at baseline to 50.1% at intervention -0.96

Shows smaller models are more sensitive to reflection reduction on non-math tasks

Source paper

extracted_from

ReflCtrl: Controlling LLM Reflection via Representation Engineering

(2025) · Ge Yan · Sun, Chung-En · Tsui-Wei · Weng

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DeepSeek-R1 Llama 8b gains 0.16% accuracy on GSM8k with positive intervention (more reflections) at cost of ~2000 additional tokensfinding0.834
Only model showing marginal benefit from increased reflection, at substantial token cost
QwQ-32B accuracy on MMLU Formal Logic stays between 95.5% and 96.3% across all intervention strengths while tokens reduced from 1716.6 to 1481.4 at -0.96finding0.810
Demonstrates reflection redundancy in larger models on non-mathematical reasoning
Easy questions (acc > 80%) have average reflection rate of 25.8% for DeepSeek-R1 Llama 8b on GSM8kfinding0.785
Baseline reflection rate for easy questions confirming difficulty-reflection correlation
Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratiofinding0.768
Shows behavioral pattern of self-correction is trainable in smaller models
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesfinding0.760
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
Correlation between layer-wise scores and task accuracy ρ = −0.73 (p < 0.001) on LLaMAfinding0.755
Core E3 finding validating S as a predictor of anchoring effectiveness
Model final answer is decodable from activations far earlier in CoT than CoT monitor detects on MMLU recall-based questions for both DeepSeek-R1 671B and GPT-OSS 120Bfinding0.753
Core empirical result demonstrating early belief formation in easy tasks
Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.751
Central interpretive claim of the paper supported by causal ablation and activation evidence