finding

active

finding:reflection-direction-features-achieve-auroc-0-772-vs-0-736-for-final-layer-baseline-on-deepseek-llama-8b-on-gsm8k-correctness-prediction

Reflection direction features achieve AUROC 0.772 vs. 0.736 for final layer baseline on deepseek-llama-8b on GSM8k correctness prediction

Supports claim that uncertainty is encoded in reflection direction

Source paper

extracted_from

ReflCtrl: Controlling LLM Reflection via Representation Engineering

(2025) · Ge Yan · Sun, Chung-En · Tsui-Wei · Weng

Neighborhood — ranked by edge-count

Claims (1)

claim

Model's uncertainty information is encoded in the reflection direction
supports
Interpretive claim from probing experiment showing reflection direction features outperform baseline for uncertainty prediction

Hypotheses (1)

hypothesis

Reasoning LLMs trigger reflection when their internal uncertainty is high
supports
Core hypothesis linking internal uncertainty to self-reflection behavior, tested via probing experiments

Questions (1)

question

When does the model initiate reflection during its reasoning process?
answered_by
First central research question motivating ReflCtrl investigation

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Easy questions (acc > 80%) have average reflection rate of 25.8% for DeepSeek-R1 Llama 8b on GSM8kfinding0.834
Baseline reflection rate for easy questions confirming difficulty-reflection correlation
No Reflection with 'Answer' achieves accuracy .037 on gsm8k_adv for Qwen2.5-3Bfinding0.819
Baseline accuracy when reflection is suppressed.
Layer 27 (last layer) has largest projection magnitude on the reflection direction among all attention head layers in DeepSeek-R1-Qwen-1.5Bfinding0.818
Attribution finding suggesting the last layer directly controls reflection keyword generation
Triggered Reflection with 'Alternatively' achieves accuracy .684 on gsm8k_adv for Gemma3-4B-ITfinding0.804
Highest single-instruction accuracy result in the paper.
DeepSeek-R1 Llama 8b gains 0.16% accuracy on GSM8k with positive intervention (more reflections) at cost of ~2000 additional tokensfinding0.794
Only model showing marginal benefit from increased reflection, at substantial token cost
Clear accuracy stratification across three reflection levels on cruxeval_o_adv: Triggered (.065/.247) > Intrinsic (.040/.133) > No Reflection (.017/.051) for Qwen2.5-3B/Gemma3-4B-ITfinding0.791
Core empirical result validating the three-level reflection framework on code reasoning.
Factual tasks F0-F3 reach near-perfect AUROC in early-to-mid layers of Llama-3.1-8B; arithmetic tasks A1-A3 emerge much later; counting tasks F4-F5 emerge late similar to arithmetic.finding0.784
Core empirical finding about layer-dependent truth direction emergence across task types.
Reflection-inducing directions emerge more clearly in higher layers (ℓ>5) for both models and datasetsfinding0.783
Empirical observation about which network layers encode reflection-relevant information.