finding
active
finding:deepseek-r1-llama-8b-accuracy-on-mmlu-professional-accounting-drops-from-56-5-at-baseline-to-50-1-at-intervention-0-96DeepSeek-R1 Llama 8b accuracy on MMLU Professional Accounting drops from 56.5% at baseline to 50.1% at intervention -0.96
Shows smaller models are more sensitive to reflection reduction on non-math tasks
Source paper
extracted_from(2025) · Ge Yan · Sun, Chung-En · Tsui-Wei · Weng
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Only model showing marginal benefit from increased reflection, at substantial token cost
- Demonstrates reflection redundancy in larger models on non-mathematical reasoning
- Easy questions (acc > 80%) have average reflection rate of 25.8% for DeepSeek-R1 Llama 8b on GSM8kfinding0.785Baseline reflection rate for easy questions confirming difficulty-reflection correlation
- Shows behavioral pattern of self-correction is trainable in smaller models
- Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
- Core E3 finding validating S as a predictor of anchoring effectiveness
- Core empirical result demonstrating early belief formation in easy tasks
- Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.751Central interpretive claim of the paper supported by causal ablation and activation evidence