paper
active
2025
paper:doi-10-48550-arxiv-2512-13979

ReflCtrl: Controlling LLM Reflection via Representation Engineering

TL;DR

ReflCtrl demonstrates that self-reflection in reasoning LLMs is governed by an identifiable direction in latent representation space and that suppressing this direction via stepwise steering can reduce reasoning token usage by up to 33.6% with negligible accuracy loss. The framework, ReflCtrl, extracts a reflection direction as the mean difference between MLP and attention output embeddings at reflection-initiating versus non-reflection steps, then injects or suppresses this direction only at reasoning step boundaries (tokens matching "\n\n"), avoiding the representation drift that degrades all-token steering. Across QwQ-32B, DeepSeek-R1 Llama 8B, and DeepSeek-R1 Qwen 14B evaluated on GSM8k, MATH-500, and three MMLU subsets, stronger models show near-total insensitivity to reflection suppression: QwQ-32B loses only 0.34% accuracy on MATH-500 while cutting tokens by 21.0%, and DS-Qwen-14B loses under 2.3% accuracy on MATH-500 at the maximum suppression setting. A logistic regression classifier trained on reflection-direction projections outperforms final-layer embeddings at predicting answer correctness—AUROC 0.850 versus 0.716 for DS-Qwen-14B—establishing that uncertainty information is encoded in the reflection direction. The paper argues this implies self-reflection is triggered by internal uncertainty perception and that for capable models a substantial fraction of reflective steps are computationally redundant, making uncertainty-aware dynamic steering a tractable target for further inference-cost reduction.

What to take away

  1. 1. ReflCtrl extracts a reflection direction as the mean difference between MLP and attention embeddings at reflection-initiating steps versus non-reflection steps, computed from the first token of each reasoning step delimited by "\n\n".
  2. 2. Stepwise steering—applying the direction vector only when the model begins a new thinking step rather than at every generated token—preserves accuracy across intervention strengths where all-token steering degrades performance by more than 5% on GSM8k with DeepSeek-R1 Llama 8B.
  3. 3. QwQ-32B loses only 0.14% accuracy on GSM8k and 0.34% on MATH-500 at intervention strength −0.96 while reducing reasoning tokens by 32.4% and 21.0% respectively, demonstrating strong reflection redundancy in large non-distilled models.
  4. 4. DeepSeek-R1 Qwen 14B on MMLU Professional Accounting retains 78.5% accuracy at maximum suppression versus 77.8% baseline, saving 33.6% of reasoning tokens, the largest token reduction reported across all experiments.
  5. 5. A logistic regression classifier trained on reflection-direction projections across all layers achieves AUROC 0.850 and F1 0.976 for DeepSeek-R1 Qwen 14B on GSM8k, outperforming a final-layer embedding baseline (AUROC 0.716, F1 0.929), supporting the hypothesis that uncertainty is encoded in the reflection direction.
  6. 6. DeepSeek-R1 Llama 8B is the only model that benefits meaningfully from increased reflection, gaining 0.92% accuracy on MATH-500 with positive intervention at the cost of approximately 2,000 additional reasoning tokens per question.
  7. 7. Self-reflection consumes 25–30% of total reasoning tokens in the models studied, establishing the empirical scale of the inference cost that reflection control targets.
  8. 8. Reflection direction attribution to individual attention heads in a DeepSeek-R1 Qwen 1.5B model (28 layers, 12 heads per layer) shows that heads with high positive projection onto the reflection direction are sparse and concentrated in deeper layers, with layer 27 showing the largest projection magnitude.
  9. 9. An open question the paper raises is whether uncertainty-aware dynamic steering—adjusting intervention strength per question and per generation step based on internal uncertainty signals—could improve the efficiency-accuracy frontier beyond the fixed-strength results reported here.
  10. 10. Reflection direction extraction uses the GSM8k training split for direction computation and applies intervention across all layers except the first and last six, a replicable configuration validated by ablation over skipped-layer counts with intervention strength fixed at λ = −0.48.

Peer brief — for seminar discussion

ReflCtrl proposes extracting a "reflection direction" from the internal representations of reasoning LLMs and using it to perform stepwise activation steering that controls how frequently models self-reflect during chain-of-thought generation. The direction is computed as the mean difference between MLP and attention output embeddings at reflection-initiating steps (identified by keywords such as "Wait" or "Let me think") versus all other steps, evaluated at the first token of each "\n\n"-delimited reasoning segment. Intervention is then applied only at step boundaries, not every token, and across all transformer layers except the first and last six—an ablation-validated configuration. The work evaluates three open-source reasoning models: DeepSeek-R1 Llama 8B, DeepSeek-R1 Qwen 14B, and QwQ-32B, on GSM8k, MATH-500, and three MMLU subsets. The load-bearing finding is that reflection is largely redundant in stronger models. QwQ-32B sustains 92.72% accuracy on MATH-500 at maximum suppression (λ = −0.96) compared to its 93.06% baseline, while using 33% fewer reasoning tokens; DeepSeek-R1 Qwen 14B saves up to 33.6% of tokens on MMLU Professional Accounting with a sub-1% accuracy delta. A secondary finding connects the reflection direction to model uncertainty: a logistic regression classifier trained on projection values along this direction achieves AUROC 0.850 for Qwen 14B versus 0.716 for a final-layer embedding baseline, suggesting internal uncertainty is geometrically encoded in the same direction that governs reflection initiation. This implies self-reflection may be a learned proxy for uncertainty resolution rather than an intrinsically beneficial reasoning operation. Compared to NoWait—the alternative approach of directly token-suppressing reflection markers—ReflCtrl offers a continuous intervention knob and incurs smaller accuracy loss under matched token budgets, because it acts on internal geometry rather than vocabulary. The paper predicts that uncertainty-aware dynamic steering, where intervention strength varies per question and per step according to real-time uncertainty estimates, could push efficiency further than the fixed-strength results reported. A critical reader would push back on the keyword-based reflection identification. Treating any step containing "Wait" or "Let me think" as a reflection step is a coarse operationalization: it likely misses implicit self-corrections and conflates rhetorical hedges with genuine reconsideration. Because the reflection direction is computed from this noisy labeling, the direction may be capturing surface-level lexical signatures rather than the underlying cognitive operation. The probing results—that the direction predicts answer correctness better than final-layer embeddings—are consistent with both interpretations, and the paper does not test whether a direction extracted with a stricter or semantically richer annotation scheme would yield qualitatively different steering effects or stronger uncertainty correlation. This also limits generalizability claims to models with identifiable reflection cues in their surface text.

Methods (1)

  • Stepwise steering
    Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token

Frameworks (1)

  • ReflCtrl
    The proposed framework for probing and steering self-reflection behavior in reasoning LLMs via representation engineering

Datasets (6)

  • DeepSeek-R1-Distill-Llama-8B
    Distilled 8B Llama-based reasoning model studied as primary experimental subject
  • DeepSeek-R1-Distill-Qwen-14B
    Distilled 14B Qwen-based reasoning model studied in experiments
  • GSM8K
    Grade school math dataset used for math task in E3.
  • MATH-500
    Harder math benchmark with 500 problems used to evaluate ReflCtrl
  • MMLU
    Benchmark used to evaluate performative reasoning; shows significantly more performative reasoning than GPQA-Diamond (easier task).
  • QwQ-32B
    Reasoning-optimized base model used for training SFR-DR-32B variant.

Findings (13)

Claims (13)

Hypotheses (2)

Questions (6)

Original abstract (expand)

Large language models (LLMs) with Chain-of-Thought (CoT) reasoning have achieved strong performance across diverse tasks, including mathematics, coding, and general reasoning. A distinctive ability of these reasoning models is self-reflection: the ability to review and revise previous reasoning steps. While self-reflection enhances reasoning performance, it also increases inference cost. In this work, we study self-reflection through the lens of representation engineering. We segment the model's reasoning into steps, identify the steps corresponding to reflection, and extract a reflection direction in the latent space that governs this behavior. Using this direction, we propose a stepwise steering method that can control reflection frequency. We call our framework ReflCtrl. Our experiments show that (1) in many cases reflections are redundant, especially in stronger models (in our experiments, we can save up to 33.6 percent of reasoning tokens while preserving performance), and (2) the model's reflection behavior is highly correlated with an internal uncertainty signal, implying self-reflection may be controlled by the model's uncertainty.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+14 more

Similar preprints — Semantic Scholar