finding
active
finding:deepseek-r1-llama-8b-gains-0-16-accuracy-on-gsm8k-with-positive-intervention-more-reflections-at-cost-of-2000-additional-tokensDeepSeek-R1 Llama 8b gains 0.16% accuracy on GSM8k with positive intervention (more reflections) at cost of ~2000 additional tokens
Only model showing marginal benefit from increased reflection, at substantial token cost
Source paper
extracted_from(2025) · Ge Yan · Sun, Chung-En · Tsui-Wei · Weng
Neighborhood — ranked by edge-count
Claims (1)
claim
- Reflections are redundant in many cases, especially in stronger modelscontradictssupportsKey interpretive finding that stronger models can have reflections reduced with minimal accuracy cost
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Easy questions (acc > 80%) have average reflection rate of 25.8% for DeepSeek-R1 Llama 8b on GSM8kfinding0.870Baseline reflection rate for easy questions confirming difficulty-reflection correlation
- Shows smaller models are more sensitive to reflection reduction on non-math tasks
- Demonstrates that stronger models are largely insensitive to reflection manipulation
- Supports claim that uncertainty is encoded in reflection direction
- Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
- LLM judge (deepseek-v3) agrees with human evaluator on 91.6% of 200 sampled jailbreak responsesfinding0.788Validates the LLM-based harm evaluation rubric
- Shows behavioral pattern of self-correction is trainable in smaller models
- Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models