ReflCtrl: Controlling LLM Reflection via Representation Engineering

ByGe Yan ⓘ·Chung-En Sun·Tsui-Wei·WengUC San Diego

DOI 10.48550/arxiv.2512.13979 arXiv 2512.13979 OpenAlex W4417465501

Reflection direction ReflCtrl Stepwise steering DeepSeek-R1-Distill-Llama-8B Reflection redundancy DeepSeek-R1-Distill-Qwen-14B Self-reflection GSM8K MATH-500 MMLU QwQ-32B

TL;DR

ReflCtrl demonstrates that self-reflection in reasoning LLMs is governed by an identifiable direction in latent representation space and that suppressing this direction via stepwise steering can reduce reasoning token usage by up to 33.6% with negligible accuracy loss. The framework, ReflCtrl, extracts a reflection direction as the mean difference between MLP and attention output embeddings at reflection-initiating versus non-reflection steps, then injects or suppresses this direction only at reasoning step boundaries (tokens matching "\n\n"), avoiding the representation drift that degrades all-token steering. Across QwQ-32B, DeepSeek-R1 Llama 8B, and DeepSeek-R1 Qwen 14B evaluated on GSM8k, MATH-500, and three MMLU subsets, stronger models show near-total insensitivity to reflection suppression: QwQ-32B loses only 0.34% accuracy on MATH-500 while cutting tokens by 21.0%, and DS-Qwen-14B loses under 2.3% accuracy on MATH-500 at the maximum suppression setting. A logistic regression classifier trained on reflection-direction projections outperforms final-layer embeddings at predicting answer correctness—AUROC 0.850 versus 0.716 for DS-Qwen-14B—establishing that uncertainty information is encoded in the reflection direction. The paper argues this implies self-reflection is triggered by internal uncertainty perception and that for capable models a substantial fraction of reflective steps are computationally redundant, making uncertainty-aware dynamic steering a tractable target for further inference-cost reduction.

What to take away

1. ReflCtrl extracts a reflection direction as the mean difference between MLP and attention embeddings at reflection-initiating steps versus non-reflection steps, computed from the first token of each reasoning step delimited by "\n\n".
2. Stepwise steering—applying the direction vector only when the model begins a new thinking step rather than at every generated token—preserves accuracy across intervention strengths where all-token steering degrades performance by more than 5% on GSM8k with DeepSeek-R1 Llama 8B.
3. QwQ-32B loses only 0.14% accuracy on GSM8k and 0.34% on MATH-500 at intervention strength −0.96 while reducing reasoning tokens by 32.4% and 21.0% respectively, demonstrating strong reflection redundancy in large non-distilled models.
4. DeepSeek-R1 Qwen 14B on MMLU Professional Accounting retains 78.5% accuracy at maximum suppression versus 77.8% baseline, saving 33.6% of reasoning tokens, the largest token reduction reported across all experiments.
5. A logistic regression classifier trained on reflection-direction projections across all layers achieves AUROC 0.850 and F1 0.976 for DeepSeek-R1 Qwen 14B on GSM8k, outperforming a final-layer embedding baseline (AUROC 0.716, F1 0.929), supporting the hypothesis that uncertainty is encoded in the reflection direction.
6. DeepSeek-R1 Llama 8B is the only model that benefits meaningfully from increased reflection, gaining 0.92% accuracy on MATH-500 with positive intervention at the cost of approximately 2,000 additional reasoning tokens per question.
7. Self-reflection consumes 25–30% of total reasoning tokens in the models studied, establishing the empirical scale of the inference cost that reflection control targets.
8. Reflection direction attribution to individual attention heads in a DeepSeek-R1 Qwen 1.5B model (28 layers, 12 heads per layer) shows that heads with high positive projection onto the reflection direction are sparse and concentrated in deeper layers, with layer 27 showing the largest projection magnitude.
9. An open question the paper raises is whether uncertainty-aware dynamic steering—adjusting intervention strength per question and per generation step based on internal uncertainty signals—could improve the efficiency-accuracy frontier beyond the fixed-strength results reported here.
10. Reflection direction extraction uses the GSM8k training split for direction computation and applies intervention across all layers except the first and last six, a replicable configuration validated by ablation over skipped-layer counts with intervention strength fixed at λ = −0.48.

Peer brief — for seminar discussion

ReflCtrl proposes extracting a "reflection direction" from the internal representations of reasoning LLMs and using it to perform stepwise activation steering that controls how frequently models self-reflect during chain-of-thought generation. The direction is computed as the mean difference between MLP and attention output embeddings at reflection-initiating steps (identified by keywords such as "Wait" or "Let me think") versus all other steps, evaluated at the first token of each "\n\n"-delimited reasoning segment. Intervention is then applied only at step boundaries, not every token, and across all transformer layers except the first and last six—an ablation-validated configuration. The work evaluates three open-source reasoning models: DeepSeek-R1 Llama 8B, DeepSeek-R1 Qwen 14B, and QwQ-32B, on GSM8k, MATH-500, and three MMLU subsets. The load-bearing finding is that reflection is largely redundant in stronger models. QwQ-32B sustains 92.72% accuracy on MATH-500 at maximum suppression (λ = −0.96) compared to its 93.06% baseline, while using 33% fewer reasoning tokens; DeepSeek-R1 Qwen 14B saves up to 33.6% of tokens on MMLU Professional Accounting with a sub-1% accuracy delta. A secondary finding connects the reflection direction to model uncertainty: a logistic regression classifier trained on projection values along this direction achieves AUROC 0.850 for Qwen 14B versus 0.716 for a final-layer embedding baseline, suggesting internal uncertainty is geometrically encoded in the same direction that governs reflection initiation. This implies self-reflection may be a learned proxy for uncertainty resolution rather than an intrinsically beneficial reasoning operation. Compared to NoWait—the alternative approach of directly token-suppressing reflection markers—ReflCtrl offers a continuous intervention knob and incurs smaller accuracy loss under matched token budgets, because it acts on internal geometry rather than vocabulary. The paper predicts that uncertainty-aware dynamic steering, where intervention strength varies per question and per step according to real-time uncertainty estimates, could push efficiency further than the fixed-strength results reported. A critical reader would push back on the keyword-based reflection identification. Treating any step containing "Wait" or "Let me think" as a reflection step is a coarse operationalization: it likely misses implicit self-corrections and conflates rhetorical hedges with genuine reconsideration. Because the reflection direction is computed from this noisy labeling, the direction may be capturing surface-level lexical signatures rather than the underlying cognitive operation. The probing results—that the direction predicts answer correctness better than final-layer embeddings—are consistent with both interpretations, and the paper does not test whether a direction extracted with a stricter or semantically richer annotation scheme would yield qualitatively different steering effects or stronger uncertainty correlation. This also limits generalizability claims to models with identifiable reflection cues in their surface text.

Methods (1)

Stepwise steering
Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token

Frameworks (1)

ReflCtrl
The proposed framework for probing and steering self-reflection behavior in reasoning LLMs via representation engineering

Datasets (6)

DeepSeek-R1-Distill-Llama-8B
Distilled 8B Llama-based reasoning model studied as primary experimental subject
DeepSeek-R1-Distill-Qwen-14B
Distilled 14B Qwen-based reasoning model studied in experiments
GSM8K
Grade school math dataset used for math task in E3.
MATH-500
Harder math benchmark with 500 problems used to evaluate ReflCtrl
MMLU
Benchmark used to evaluate performative reasoning; shows significantly more performative reasoning than GPQA-Diamond (easier task).
QwQ-32B
Reasoning-optimized base model used for training SFR-DR-32B variant.

Findings (13)

QwQ-32B accuracy on MMLU Formal Logic stays between 95.5% and 96.3% across all intervention strengths while tokens reduced from 1716.6 to 1481.4 at -0.96
Demonstrates reflection redundancy in larger models on non-mathematical reasoning
Layer 27 (last layer) has largest projection magnitude on the reflection direction among all attention head layers in DeepSeek-R1-Qwen-1.5B
Attribution finding suggesting the last layer directly controls reflection keyword generation
Reflection direction features achieve AUROC 0.772 vs. 0.736 for final layer baseline on deepseek-llama-8b on GSM8k correctness prediction
Supports claim that uncertainty is encoded in reflection direction
Attention heads with positive projection on reflection direction are sparse and located mostly in deeper layers of DeepSeek-R1-Qwen-1.5B
Structural finding about which attention heads control reflection behavior
DeepSeek-R1 Llama 8b gains 0.16% accuracy on GSM8k with positive intervention (more reflections) at cost of ~2000 additional tokens
Only model showing marginal benefit from increased reflection, at substantial token cost
DeepSeek-R1 Llama 8b accuracy on MMLU Professional Accounting drops from 56.5% at baseline to 50.1% at intervention -0.96
Shows smaller models are more sensitive to reflection reduction on non-math tasks
Up to 33.6% reasoning tokens saved on MMLU subsets with stepwise steering while maintaining accuracy in larger models
Maximum token savings achieved by ReflCtrl on non-mathematical general reasoning tasks
Stepwise steering achieves over 5% accuracy improvement compared to all-token intervention at similar token budget
Key result demonstrating advantage of stepwise over all-token steering strategy
QwQ-32B on MATH-500: 21.0% reasoning token reduction at intervention strength -0.96 with only 0.34% accuracy loss
Demonstrates reflection redundancy in stronger model on harder math benchmark
QwQ-32B accuracy on GSM8k remains between 96.36% and 96.50% across all intervention strengths (-0.96 to +0.48)
Demonstrates that stronger models are largely insensitive to reflection manipulation

Claims (13)

The last layer of the transformer has the largest projection magnitude on the reflection direction, likely because it directly controls generation of reflection keywords
Interpretive claim from attention head attribution analysis in appendix
ReflCtrl is more flexible than NoWait because it allows fine-grained control of the accuracy-cost trade-off, while NoWait can only completely disable reflection
Comparative claim against the NoWait baseline method
The identification of reasoning steps relies on keyword search, which may be model-specific since different models could prefer different reflection cues
Limitation acknowledged regarding generalizability of the reflection identification method
Stepwise steering preserves accuracy while reducing cost, whereas all-token steering causes significant degradation at large intervention strengths
Comparative claim between the two steering strategies
Within each difficulty category, correctness rate is not correlated with reflection rate, suggesting reflection may be redundant
Per-category analysis showing reflection rate does not help within difficulty class
Higher reflection frequency correlates with lower accuracy partly because more reflections are generated on difficult questions
Author's interpretation of the negative correlation between reflection rate and accuracy observed in Fig. 5
A linear reflection direction exists in reasoning LLMs' latent representation space that governs self-reflection behavior
Core claim of ReflCtrl that a single direction captures and controls reflection
ReflCtrl only works for open-source models and it remains unclear whether it generalizes to SOTA closed-source models
Limitation of representation engineering approach shared with other methods
Developing uncertainty-aware dynamic steering is a promising future direction for improving reflection efficiency
Forward-looking claim connecting uncertainty-reflection hypothesis to practical future work
Performance is best when skipping both the first and last six layers when applying intervention
Empirical configuration finding from ablation study on layer selection

Hypotheses (2)

The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questions
Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
Reasoning LLMs trigger reflection when their internal uncertainty is high
Core hypothesis linking internal uncertainty to self-reflection behavior, tested via probing experiments

Questions (6)

Current steering applies fixed strength; dynamic uncertainty-aware steering during inference is an open gap
Research gap identified in limitations/future work section connecting uncertainty findings to practical improvement
The underlying mechanism of self-reflection in reasoning LLMs is not yet well understood
Broad gap motivating the entire paper
does the ReflCtrl approach generalize to closed-source models such as GPT-4 or Claude?
Open limitation question about broader applicability
what is the underlying mechanism of self-reflection in reasoning LLMs?
Open question motivating the entire paper; identified as not yet well understood
When does the model initiate reflection during its reasoning process?
First central research question motivating ReflCtrl investigation
How does reflection influence the model's reasoning performance?
Second central research question motivating ReflCtrl investigation

Original abstract (expand)

Large language models (LLMs) with Chain-of-Thought (CoT) reasoning have achieved strong performance across diverse tasks, including mathematics, coding, and general reasoning. A distinctive ability of these reasoning models is self-reflection: the ability to review and revise previous reasoning steps. While self-reflection enhances reasoning performance, it also increases inference cost. In this work, we study self-reflection through the lens of representation engineering. We segment the model's reasoning into steps, identify the steps corresponding to reflection, and extract a reflection direction in the latent space that governs this behavior. Using this direction, we propose a stepwise steering method that can control reflection frequency. We call our framework ReflCtrl. Our experiments show that (1) in many cases reflections are redundant, especially in stronger models (in our experiments, we can save up to 33.6 percent of reasoning tokens while preserving performance), and (2) the model's reflection behavior is highly correlated with an internal uncertainty signal, implying self-reflection may be controlled by the model's uncertainty.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Unveiling the Latent Directions of Reflection in Large Language Models
in corpus
2025
≈ 86%
Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification
Chang Liu, Feng Tang, Yujie Yuan, Aojun Zhou, Kui Zhang, Xi Yang, Yangqiu Song Shuai Lv
2026
≈ 84%
Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs
Junkang Wu, Jinghan Li, Kexin Huang, Shuo Yang, Guoyin Wang, Jiancan Wu, Xiang Wang, Xiangnan He Jinda Lu
2026
≈ 84%
RefineRL: Advancing Competitive Programming with Self-Refinement Reinforcement Learning
Xingxing Zhang, Li Dong, Di Wang, Furu Wei Shaopeng Fu
2026
≈ 84%
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
David Evans Hannah Cyberey
2025
≈ 83%
Aligning Large Language Models with Human Preferences through Representation Engineering
Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang Wenhao Liu
2024
≈ 83%
How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning
Haoyang Chen and Yi Liu and Jianzhi Shao and Tao Zhang and Chengfu Huo and Wei Hu
2026
≈ 82%
Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning
Swapnil Parekh
2026
≈ 82%
Internalizing LLM Reasoning via Discovery and Replay of Latent Actions
Yijia Zhu, Junhan Shi, Xun Zhang, Lei Wang, Congcong Miao Zhenning Shi
2026
≈ 82%
Improving LLM Reasoning through Interpretable Role-Playing Steering
Dong Shu, Yifan Wang, Yunpu Ma, Mengnan Du Anyi Wang
2025
≈ 82%
Reinforcing Structured Chain-of-Thought for Video Understanding
Haotian Xu, Noranart Vesdapunt, Rui Hou, Jingyi Zhang, Haibin Ling, Oleksandr Obiednikov, Ning Zhou, Kah Kuen Fu Peiyao Wang
2026
≈ 82%
Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging
Haonan Wang, Jian Kang, Kenji Kawaguchi, Jiaying Wu Zihang Fu
2026
≈ 82%
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
David Montero, Roman Orus Iker Garc\'ia-Ferrero
2026
≈ 82%
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
in corpus
2025
≈ 82%
Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness
Pavan Chakraborty Abhinaba Basu
2026
≈ 82%
Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement
Lin Xu, Yang Sun, Wenjun Li, Jie Shi Yuxiao Lu
2026
≈ 82%
Testing the Limits of Truth Directions in LLMs
in corpus
2026
≈ 82%
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
Mingyu Kang, Yong Suk Choi Keuntae Kim
2026
≈ 82%
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
in corpus
2026
≈ 81%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 81%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 80%
From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
in corpus
2025
≈ 80%
Psychological Steering of Large Language Models
in corpus
2026
≈ 79%
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
in corpus
2025
≈ 79%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 79%
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
in corpus
2025
≈ 79%
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
in corpus
2024
≈ 79%
Representation engineering: A top-down approach to AI transparency
cited
2023
≈ 68%
Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training
cited
2025
Understanding Aha Moments: from External Observations to Internal Mechanisms
cited
2025

+14 more