finding

active

finding:deepseek-r1-llama-8b-gains-0-16-accuracy-on-gsm8k-with-positive-intervention-more-reflections-at-cost-of-2000-additional-tokens

DeepSeek-R1 Llama 8b gains 0.16% accuracy on GSM8k with positive intervention (more reflections) at cost of ~2000 additional tokens

Only model showing marginal benefit from increased reflection, at substantial token cost

Source paper

extracted_from

ReflCtrl: Controlling LLM Reflection via Representation Engineering

(2025) · Ge Yan · Sun, Chung-En · Tsui-Wei · Weng

Neighborhood — ranked by edge-count

Claims (1)

claim

Reflections are redundant in many cases, especially in stronger models
contradictssupports
Key interpretive finding that stronger models can have reflections reduced with minimal accuracy cost

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Easy questions (acc > 80%) have average reflection rate of 25.8% for DeepSeek-R1 Llama 8b on GSM8kfinding0.870
Baseline reflection rate for easy questions confirming difficulty-reflection correlation
DeepSeek-R1 Llama 8b accuracy on MMLU Professional Accounting drops from 56.5% at baseline to 50.1% at intervention -0.96finding0.834
Shows smaller models are more sensitive to reflection reduction on non-math tasks
QwQ-32B accuracy on GSM8k remains between 96.36% and 96.50% across all intervention strengths (-0.96 to +0.48)finding0.808
Demonstrates that stronger models are largely insensitive to reflection manipulation
Reflection direction features achieve AUROC 0.772 vs. 0.736 for final layer baseline on deepseek-llama-8b on GSM8k correctness predictionfinding0.794
Supports claim that uncertainty is encoded in reflection direction
Llama-3.3-70B shows multi-attempt rate of 7.4% vs. ≤1.2% for all other models testedfinding0.790
Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
LLM judge (deepseek-v3) agrees with human evaluator on 91.6% of 200 sampled jailbreak responsesfinding0.788
Validates the LLM-based harm evaluation rubric
Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratiofinding0.785
Shows behavioral pattern of self-correction is trainable in smaller models
All five judge models consistently rank Llama-3.3-70B as having substantially higher ESR rates than other modelsfinding0.783
Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models