paper:doi-10-48550-arxiv-2512-13979ReflCtrl: Controlling LLM Reflection via Representation Engineering
TL;DR
ReflCtrl demonstrates that self-reflection in reasoning LLMs is governed by an identifiable direction in latent representation space and that suppressing this direction via stepwise steering can reduce reasoning token usage by up to 33.6% with negligible accuracy loss. The framework, ReflCtrl, extracts a reflection direction as the mean difference between MLP and attention output embeddings at reflection-initiating versus non-reflection steps, then injects or suppresses this direction only at reasoning step boundaries (tokens matching "\n\n"), avoiding the representation drift that degrades all-token steering. Across QwQ-32B, DeepSeek-R1 Llama 8B, and DeepSeek-R1 Qwen 14B evaluated on GSM8k, MATH-500, and three MMLU subsets, stronger models show near-total insensitivity to reflection suppression: QwQ-32B loses only 0.34% accuracy on MATH-500 while cutting tokens by 21.0%, and DS-Qwen-14B loses under 2.3% accuracy on MATH-500 at the maximum suppression setting. A logistic regression classifier trained on reflection-direction projections outperforms final-layer embeddings at predicting answer correctness—AUROC 0.850 versus 0.716 for DS-Qwen-14B—establishing that uncertainty information is encoded in the reflection direction. The paper argues this implies self-reflection is triggered by internal uncertainty perception and that for capable models a substantial fraction of reflective steps are computationally redundant, making uncertainty-aware dynamic steering a tractable target for further inference-cost reduction.
What to take away
- 1. ReflCtrl extracts a reflection direction as the mean difference between MLP and attention embeddings at reflection-initiating steps versus non-reflection steps, computed from the first token of each reasoning step delimited by "\n\n".
- 2. Stepwise steering—applying the direction vector only when the model begins a new thinking step rather than at every generated token—preserves accuracy across intervention strengths where all-token steering degrades performance by more than 5% on GSM8k with DeepSeek-R1 Llama 8B.
- 3. QwQ-32B loses only 0.14% accuracy on GSM8k and 0.34% on MATH-500 at intervention strength −0.96 while reducing reasoning tokens by 32.4% and 21.0% respectively, demonstrating strong reflection redundancy in large non-distilled models.
- 4. DeepSeek-R1 Qwen 14B on MMLU Professional Accounting retains 78.5% accuracy at maximum suppression versus 77.8% baseline, saving 33.6% of reasoning tokens, the largest token reduction reported across all experiments.
- 5. A logistic regression classifier trained on reflection-direction projections across all layers achieves AUROC 0.850 and F1 0.976 for DeepSeek-R1 Qwen 14B on GSM8k, outperforming a final-layer embedding baseline (AUROC 0.716, F1 0.929), supporting the hypothesis that uncertainty is encoded in the reflection direction.
- 6. DeepSeek-R1 Llama 8B is the only model that benefits meaningfully from increased reflection, gaining 0.92% accuracy on MATH-500 with positive intervention at the cost of approximately 2,000 additional reasoning tokens per question.
- 7. Self-reflection consumes 25–30% of total reasoning tokens in the models studied, establishing the empirical scale of the inference cost that reflection control targets.
- 8. Reflection direction attribution to individual attention heads in a DeepSeek-R1 Qwen 1.5B model (28 layers, 12 heads per layer) shows that heads with high positive projection onto the reflection direction are sparse and concentrated in deeper layers, with layer 27 showing the largest projection magnitude.
- 9. An open question the paper raises is whether uncertainty-aware dynamic steering—adjusting intervention strength per question and per generation step based on internal uncertainty signals—could improve the efficiency-accuracy frontier beyond the fixed-strength results reported here.
- 10. Reflection direction extraction uses the GSM8k training split for direction computation and applies intervention across all layers except the first and last six, a replicable configuration validated by ablation over skipped-layer counts with intervention strength fixed at λ = −0.48.
Peer brief — for seminar discussion
ReflCtrl proposes extracting a "reflection direction" from the internal representations of reasoning LLMs and using it to perform stepwise activation steering that controls how frequently models self-reflect during chain-of-thought generation. The direction is computed as the mean difference between MLP and attention output embeddings at reflection-initiating steps (identified by keywords such as "Wait" or "Let me think") versus all other steps, evaluated at the first token of each "\n\n"-delimited reasoning segment. Intervention is then applied only at step boundaries, not every token, and across all transformer layers except the first and last six—an ablation-validated configuration. The work evaluates three open-source reasoning models: DeepSeek-R1 Llama 8B, DeepSeek-R1 Qwen 14B, and QwQ-32B, on GSM8k, MATH-500, and three MMLU subsets. The load-bearing finding is that reflection is largely redundant in stronger models. QwQ-32B sustains 92.72% accuracy on MATH-500 at maximum suppression (λ = −0.96) compared to its 93.06% baseline, while using 33% fewer reasoning tokens; DeepSeek-R1 Qwen 14B saves up to 33.6% of tokens on MMLU Professional Accounting with a sub-1% accuracy delta. A secondary finding connects the reflection direction to model uncertainty: a logistic regression classifier trained on projection values along this direction achieves AUROC 0.850 for Qwen 14B versus 0.716 for a final-layer embedding baseline, suggesting internal uncertainty is geometrically encoded in the same direction that governs reflection initiation. This implies self-reflection may be a learned proxy for uncertainty resolution rather than an intrinsically beneficial reasoning operation. Compared to NoWait—the alternative approach of directly token-suppressing reflection markers—ReflCtrl offers a continuous intervention knob and incurs smaller accuracy loss under matched token budgets, because it acts on internal geometry rather than vocabulary. The paper predicts that uncertainty-aware dynamic steering, where intervention strength varies per question and per step according to real-time uncertainty estimates, could push efficiency further than the fixed-strength results reported. A critical reader would push back on the keyword-based reflection identification. Treating any step containing "Wait" or "Let me think" as a reflection step is a coarse operationalization: it likely misses implicit self-corrections and conflates rhetorical hedges with genuine reconsideration. Because the reflection direction is computed from this noisy labeling, the direction may be capturing surface-level lexical signatures rather than the underlying cognitive operation. The probing results—that the direction predicts answer correctness better than final-layer embeddings—are consistent with both interpretations, and the paper does not test whether a direction extracted with a stricter or semantically richer annotation scheme would yield qualitatively different steering effects or stronger uncertainty correlation. This also limits generalizability claims to models with identifiable reflection cues in their surface text.
Methods (1)
- Stepwise steeringNovel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token
Frameworks (1)
- ReflCtrlThe proposed framework for probing and steering self-reflection behavior in reasoning LLMs via representation engineering
Datasets (6)
- DeepSeek-R1-Distill-Llama-8BDistilled 8B Llama-based reasoning model studied as primary experimental subject
- DeepSeek-R1-Distill-Qwen-14BDistilled 14B Qwen-based reasoning model studied in experiments
- GSM8KGrade school math dataset used for math task in E3.
- MATH-500Harder math benchmark with 500 problems used to evaluate ReflCtrl
- MMLUBenchmark used to evaluate performative reasoning; shows significantly more performative reasoning than GPQA-Diamond (easier task).
- QwQ-32BReasoning-optimized base model used for training SFR-DR-32B variant.
Findings (13)
- QwQ-32B accuracy on MMLU Formal Logic stays between 95.5% and 96.3% across all intervention strengths while tokens reduced from 1716.6 to 1481.4 at -0.96
Demonstrates reflection redundancy in larger models on non-mathematical reasoning
- Layer 27 (last layer) has largest projection magnitude on the reflection direction among all attention head layers in DeepSeek-R1-Qwen-1.5B
Attribution finding suggesting the last layer directly controls reflection keyword generation
- Reflection direction features achieve AUROC 0.772 vs. 0.736 for final layer baseline on deepseek-llama-8b on GSM8k correctness prediction
Supports claim that uncertainty is encoded in reflection direction
- Attention heads with positive projection on reflection direction are sparse and located mostly in deeper layers of DeepSeek-R1-Qwen-1.5B
Structural finding about which attention heads control reflection behavior
- DeepSeek-R1 Llama 8b gains 0.16% accuracy on GSM8k with positive intervention (more reflections) at cost of ~2000 additional tokens
Only model showing marginal benefit from increased reflection, at substantial token cost
- DeepSeek-R1 Llama 8b accuracy on MMLU Professional Accounting drops from 56.5% at baseline to 50.1% at intervention -0.96
Shows smaller models are more sensitive to reflection reduction on non-math tasks
- Up to 33.6% reasoning tokens saved on MMLU subsets with stepwise steering while maintaining accuracy in larger models
Maximum token savings achieved by ReflCtrl on non-mathematical general reasoning tasks
- Stepwise steering achieves over 5% accuracy improvement compared to all-token intervention at similar token budget
Key result demonstrating advantage of stepwise over all-token steering strategy
- QwQ-32B on MATH-500: 21.0% reasoning token reduction at intervention strength -0.96 with only 0.34% accuracy loss
Demonstrates reflection redundancy in stronger model on harder math benchmark
- QwQ-32B accuracy on GSM8k remains between 96.36% and 96.50% across all intervention strengths (-0.96 to +0.48)
Demonstrates that stronger models are largely insensitive to reflection manipulation
Claims (13)
- The last layer of the transformer has the largest projection magnitude on the reflection direction, likely because it directly controls generation of reflection keywords
Interpretive claim from attention head attribution analysis in appendix
- ReflCtrl is more flexible than NoWait because it allows fine-grained control of the accuracy-cost trade-off, while NoWait can only completely disable reflection
Comparative claim against the NoWait baseline method
- The identification of reasoning steps relies on keyword search, which may be model-specific since different models could prefer different reflection cues
Limitation acknowledged regarding generalizability of the reflection identification method
- Stepwise steering preserves accuracy while reducing cost, whereas all-token steering causes significant degradation at large intervention strengths
Comparative claim between the two steering strategies
- Within each difficulty category, correctness rate is not correlated with reflection rate, suggesting reflection may be redundant
Per-category analysis showing reflection rate does not help within difficulty class
- Higher reflection frequency correlates with lower accuracy partly because more reflections are generated on difficult questions
Author's interpretation of the negative correlation between reflection rate and accuracy observed in Fig. 5
- A linear reflection direction exists in reasoning LLMs' latent representation space that governs self-reflection behavior
Core claim of ReflCtrl that a single direction captures and controls reflection
- ReflCtrl only works for open-source models and it remains unclear whether it generalizes to SOTA closed-source models
Limitation of representation engineering approach shared with other methods
- Developing uncertainty-aware dynamic steering is a promising future direction for improving reflection efficiency
Forward-looking claim connecting uncertainty-reflection hypothesis to practical future work
- Performance is best when skipping both the first and last six layers when applying intervention
Empirical configuration finding from ablation study on layer selection
Hypotheses (2)
- The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questions
Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
- Reasoning LLMs trigger reflection when their internal uncertainty is high
Core hypothesis linking internal uncertainty to self-reflection behavior, tested via probing experiments
Questions (6)
- Current steering applies fixed strength; dynamic uncertainty-aware steering during inference is an open gap
Research gap identified in limitations/future work section connecting uncertainty findings to practical improvement
- The underlying mechanism of self-reflection in reasoning LLMs is not yet well understood
Broad gap motivating the entire paper
- does the ReflCtrl approach generalize to closed-source models such as GPT-4 or Claude?
Open limitation question about broader applicability
- what is the underlying mechanism of self-reflection in reasoning LLMs?
Open question motivating the entire paper; identified as not yet well understood
- When does the model initiate reflection during its reasoning process?
First central research question motivating ReflCtrl investigation
- How does reflection influence the model's reasoning performance?
Second central research question motivating ReflCtrl investigation
Original abstract (expand)
Large language models (LLMs) with Chain-of-Thought (CoT) reasoning have achieved strong performance across diverse tasks, including mathematics, coding, and general reasoning. A distinctive ability of these reasoning models is self-reflection: the ability to review and revise previous reasoning steps. While self-reflection enhances reasoning performance, it also increases inference cost. In this work, we study self-reflection through the lens of representation engineering. We segment the model's reasoning into steps, identify the steps corresponding to reflection, and extract a reflection direction in the latent space that governs this behavior. Using this direction, we propose a stepwise steering method that can control reflection frequency. We call our framework ReflCtrl. Our experiments show that (1) in many cases reflections are redundant, especially in stronger models (in our experiments, we can save up to 33.6 percent of reasoning tokens while preserving performance), and (2) the model's reflection behavior is highly correlated with an internal uncertainty signal, implying self-reflection may be controlled by the model's uncertainty.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- ≈ 86%
- Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven VerificationChang Liu, Feng Tang, Yujie Yuan, Aojun Zhou, Kui Zhang, Xi Yang, Yangqiu Song Shuai Lv2026≈ 84%
- Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMsJunkang Wu, Jinghan Li, Kexin Huang, Shuo Yang, Guoyin Wang, Jiancan Wu, Xiang Wang, Xiangnan He Jinda Lu2026≈ 84%
- RefineRL: Advancing Competitive Programming with Self-Refinement Reinforcement LearningXingxing Zhang, Li Dong, Di Wang, Furu Wei Shaopeng Fu2026≈ 84%
- Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" ControlDavid Evans Hannah Cyberey2025≈ 83%
- Aligning Large Language Models with Human Preferences through Representation EngineeringXiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang Wenhao Liu2024≈ 83%
- How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative ReasoningHaoyang Chen and Yi Liu and Jianzhi Shao and Tao Zhang and Chengfu Huo and Wei Hu2026≈ 82%
- ≈ 82%
- Internalizing LLM Reasoning via Discovery and Replay of Latent ActionsYijia Zhu, Junhan Shi, Xun Zhang, Lei Wang, Congcong Miao Zhenning Shi2026≈ 82%
- Improving LLM Reasoning through Interpretable Role-Playing SteeringDong Shu, Yifan Wang, Yunpu Ma, Mengnan Du Anyi Wang2025≈ 82%
- Reinforcing Structured Chain-of-Thought for Video UnderstandingHaotian Xu, Noranart Vesdapunt, Rui Hou, Jingyi Zhang, Haibin Ling, Oleksandr Obiednikov, Ning Zhou, Kah Kuen Fu Peiyao Wang2026≈ 82%
- Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective MergingHaonan Wang, Jian Kang, Kenji Kawaguchi, Jiaying Wu Zihang Fu2026≈ 82%
- Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive TopicsDavid Montero, Roman Orus Iker Garc\'ia-Ferrero2026≈ 82%
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Modelsin corpus2025≈ 82%
- Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulnessPavan Chakraborty Abhinaba Basu2026≈ 82%
- Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive RefinementLin Xu, Yang Sun, Wenjun Li, Jie Shi Yuxiao Lu2026≈ 82%
- Testing the Limits of Truth Directions in LLMsin corpus2026≈ 82%
- Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language ModelsMingyu Kang, Yong Suk Choi Keuntae Kim2026≈ 82%
- ≈ 81%
- ≈ 81%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 80%
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMsin corpus2025≈ 80%
- Psychological Steering of Large Language Modelsin corpus2026≈ 79%
- ≈ 79%
- ≈ 79%
- ≈ 79%
- ≈ 79%
- ≈ 68%
+14 more