finding

active

finding:easy-questions-acc-80-have-average-reflection-rate-of-25-8-for-deepseek-r1-llama-8b-on-gsm8k

Easy questions (acc > 80%) have average reflection rate of 25.8% for DeepSeek-R1 Llama 8b on GSM8k

Baseline reflection rate for easy questions confirming difficulty-reflection correlation

Source paper

extracted_from

ReflCtrl: Controlling LLM Reflection via Representation Engineering

(2025) · Ge Yan · Sun, Chung-En · Tsui-Wei · Weng

Neighborhood — ranked by edge-count

Hypotheses (1)

hypothesis

The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questions
supports
Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DeepSeek-R1 Llama 8b gains 0.16% accuracy on GSM8k with positive intervention (more reflections) at cost of ~2000 additional tokensfinding0.870
Only model showing marginal benefit from increased reflection, at substantial token cost
Reflection direction features achieve AUROC 0.772 vs. 0.736 for final layer baseline on deepseek-llama-8b on GSM8k correctness predictionfinding0.834
Supports claim that uncertainty is encoded in reflection direction
No Reflection with 'Answer' achieves accuracy .037 on gsm8k_adv for Qwen2.5-3Bfinding0.795
Baseline accuracy when reflection is suppressed.
DeepSeek-R1 Llama 8b accuracy on MMLU Professional Accounting drops from 56.5% at baseline to 50.1% at intervention -0.96finding0.785
Shows smaller models are more sensitive to reflection reduction on non-math tasks
All five judge models consistently rank Llama-3.3-70B as having substantially higher ESR rates than other modelsfinding0.771
Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
Llama-3.3-70B shows multi-attempt rate of 7.4% vs. ≤1.2% for all other models testedfinding0.770
Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
Triggered Reflection with 'Alternatively' achieves accuracy .684 on gsm8k_adv for Gemma3-4B-ITfinding0.764
Highest single-instruction accuracy result in the paper.
Llama 3.1 405B shows 14% compliance gap in minimal helpful-only replication but smaller Llama and Mistral models show no gapfinding0.761
Replication across open-weight models supports scale-emergence finding