finding
active
finding:easy-questions-acc-80-have-average-reflection-rate-of-25-8-for-deepseek-r1-llama-8b-on-gsm8kEasy questions (acc > 80%) have average reflection rate of 25.8% for DeepSeek-R1 Llama 8b on GSM8k
Baseline reflection rate for easy questions confirming difficulty-reflection correlation
Source paper
extracted_from(2025) · Ge Yan · Sun, Chung-En · Tsui-Wei · Weng
Neighborhood — ranked by edge-count
Hypotheses (1)
hypothesis
- Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Only model showing marginal benefit from increased reflection, at substantial token cost
- Supports claim that uncertainty is encoded in reflection direction
- Baseline accuracy when reflection is suppressed.
- Shows smaller models are more sensitive to reflection reduction on non-math tasks
- Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
- Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
- Triggered Reflection with 'Alternatively' achieves accuracy .684 on gsm8k_adv for Gemma3-4B-ITfinding0.764Highest single-instruction accuracy result in the paper.
- Replication across open-weight models supports scale-emergence finding