paper:doi-10-48550-arxiv-2508-16989Unveiling the Latent Directions of Reflection in Large Language Models
TL;DR
Reflection in LLMs corresponds to a recoverable latent direction in activation space, not merely a behavioral artifact of prompt engineering. Working with Qwen2.5-3B and Gemma3-4B-IT on the adversarial benchmarks gsm8k_adv and cruxeval_o_adv, the paper demonstrates a clear three-way stratification in exact-match accuracy: triggered reflection (average 0.397 / 0.586 on gsm8k_adv for the two models) substantially outperforms intrinsic reflection (0.295 / 0.335) and no-reflection conditions (0.051 / 0.147). The central method introduced is contrastive activation steering, in which steering vectors are computed as mean activation differences at the appended-instruction token position across reflection level pairs (e.g., µ⁰→² from No Reflection to Triggered Reflection), then used either to rank candidate trigger tokens by cosine similarity or to directly intervene in residual-stream activations at a chosen layer ℓ. Using layer-12 steering vectors, previously unreported tokens such as "However," "Oops," and "Validate" are identified as effective reflection triggers with cosine similarities up to 0.978 against the canonical steering direction, and these tokens achieve accuracy competitive with established triggers like "Wait" and "Alternatively." Crucially, inhibition interventions produce larger accuracy drops than enhancement interventions produce accuracy gains — suppressing reflection is mechanistically easier than inducing it — which the paper argues implies a concrete adversarial risk: jailbreak-style prefix attacks that append high-certainty continuations effectively exploit this asymmetry to disable internal safety-checking, and future defenses should target the identified reflection direction directly.
What to take away
- 1. On gsm8k_adv, Gemma3-4B-IT achieves average exact-match accuracy of 0.586 under triggered reflection (Wait/Alternatively/Check), 0.335 under intrinsic reflection ([EOS]/#/%), and 0.147 under no-reflection instructions (Answer/Result/Output), confirming a clean three-way stratification.
- 2. Qwen2.5-3B on cruxeval_o_adv shows a similarly ordered but lower-absolute stratification: triggered reflection averages 0.065, intrinsic reflection 0.040, and no reflection 0.017, demonstrating the pattern holds across model families and task types.
- 3. Contrastive activation steering vectors (µ^ℓ_{a→b}) are constructed as the mean activation difference at the appended-instruction token position across reflection level pairs, extracted from the output of the ℓ-th attention block added to the residual stream.
- 4. Steering vectors derived from µ⁰→² (No Reflection vs. Triggered Reflection) systematically outperform µ¹→² (Intrinsic vs. Triggered Reflection) for discovering new trigger instructions, indicating that the more extreme contrastive pair encodes a stronger latent signal.
- 5. At layer ℓ=12, the top-5 instructions selected by µ⁰→² cosine similarity for Gemma3-4B-IT include "Verify" (cosine sim 0.978), "Confirm" (0.964), "Initially" (0.961), "Oops" (0.958), and "Validate" (0.956), achieving average gsm8k_adv accuracy of 0.565 versus 0.526 for the embedding-similarity baseline.
- 6. Reflection-inducing latent directions emerge primarily in higher layers (ℓ > 5), consistent with late-stage integration of semantic and reasoning signals rather than shallow lexical processing.
- 7. Inhibition steering interventions produce accuracy drops that are systematically larger in magnitude than the accuracy gains produced by enhancement interventions, establishing a mechanistic asymmetry: terminating a reasoning trajectory requires less representational work than initiating error-correction.
- 8. A replicable methodology choice: candidate reflection-trigger tokens are drawn from the full Qwen2.5 and Gemma3 tokenizer vocabularies, normalized via NLTK stemming and lemmatization, then ranked by cosine similarity to µ^ℓ_{a→b} on a gsm8k_adv training split and evaluated on a held-out test split.
- 9. The input-embedding cosine similarity baseline selects semantically adjacent but non-reflective tokens such as "Await," "ConfigureAwait," and "Unchecked" that fail to improve accuracy, confirming that steering vectors capture a functionally distinct latent dimension not present in surface token similarity.
- 10. An open question the paper raises is whether LLMs internally maintain a continuous "consistency score" or probability mass over coherent reasoning trajectories, and whether this quantity is what steering vectors modulate during reflection — a hypothesis the paper leaves unformalized, calling for probabilistic and information-theoretic tools.
Peer brief — for seminar discussion
This NeurIPS 2025 Mechanistic Interpretability Workshop paper investigates whether LLM reflection — the capacity to detect errors in a prior chain-of-thought and revise conclusions — is encoded as a recoverable linear direction in activation space, rather than being a purely surface-level response to prompting. The experimental setup uses two open models, Qwen2.5-3B and Gemma3-4B-IT, evaluated on gsm8k_adv and cruxeval_o_adv, adversarial datasets that embed deliberate errors into reasoning steps to force genuine correction rather than answer-level guessing. Reflection is operationalized via three instruction levels appended after a flawed chain-of-thought: No Reflection (tokens like "Answer," "Output"), Intrinsic Reflection (semantically neutral tokens like [EOS], "#"), and Triggered Reflection (established cues like "Wait," "Alternatively," "Check"). Accuracy stratifies cleanly across these levels — on gsm8k_adv, Gemma3-4B-IT averages 0.586 / 0.335 / 0.147 from triggered to no-reflection, and Qwen2.5-3B averages 0.397 / 0.295 / 0.051 — validating the taxonomy before any steering is applied. The load-bearing contribution is the method of contrastive activation steering applied to reflection: steering vectors µ^ℓ_{a→b} are computed as the mean hidden-state difference at the appended-instruction token position between reflection level pairs, extracted at the output of each attention block plus its subsequent MLP. These vectors serve two functions. First, they rank candidate trigger tokens drawn from the full model vocabulary by cosine similarity to µ^ℓ_{0→2}; at layer 12, this recovers non-obvious triggers including "Oops" (cosine similarity 0.958 for Gemma3-4B-IT), "However," and "Validate" — tokens not in any prior reflective prompting catalog — while an input-embedding baseline instead surfaces tokens like "ConfigureAwait" that fail to improve accuracy. Second, the vectors enable direct inference-time intervention: adding µ^ℓ_{0→2} to activations of an "Answer"-prompted model nudges accuracy upward, while subtracting (applying µ^ℓ_{2→0}) collapses accuracy of a "Wait"-prompted model substantially. The asymmetry finding — inhibition produces larger accuracy swings than enhancement — is one of the paper's clearest quantitative claims. This asymmetry is argued to have security implications: jailbreak attacks that prepend high-certainty completions (e.g., "Absolutely! Here's") may mechanistically exploit the ease of reflection suppression to bypass safety-checking, while the harder problem of inducing reflection suggests an architectural avenue for defenses. The authors also float a theoretical hypothesis — that LLMs implicitly learn a distribution over consistent reasoning trajectories and that erroneous chains are statistical outliers under that distribution — but leave this unformalized. A critical reader would push back on experimental scale: only two models in the 3–4B parameter range are tested, and it is entirely unclear whether the identified latent direction is a surface-level association between instruction token identity and next-token statistics, rather than a genuine mechanistic encoding of reflective reasoning. The paper itself notes that activation patching, causal tracing, or circuit analysis — any of which would have been a stronger mechanistic alternative — were not applied, leaving open whether the steering vector is causally upstream of error-correction behavior or merely correlated with it. The scope is also restricted to situational reflection (correcting another source's chain-of-thought) and does not address self-reflection, limiting generalizability to the broader literature on iterative refinement and long chain-of-thought training.
Methods (7)
- Activation patchingStandard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
- AutoMecoAutomated benchmarking framework for evaluating LLM meta-cognition, mentioned as related work.
- CogTestBenchmark for cognitive habits evaluation in LRMs, mentioned as related work.
- Contrastive Steering Vector ConstructionMethod for computing steering vectors as mean activation differences between reflection levels at a given layer.
- Cosine Similarity Ranking for Instruction DiscoveryMethod to discover new reflection-inducing instructions by ranking candidate tokens by cosine similarity to steering vectors.
- Exact-Match Accuracy with Flexible Number ExtractionEvaluation metric: proportion of samples with predicted answer exactly matching ground-truth, with flexible number extraction.
- NLTK Stemming and LemmatizationUsed to normalize candidate instruction tokens in the instruction discovery experiment.
Frameworks (4)
- Activation Addition (ActAdd)Steering method deriving vectors from contrastive prompt pairs and adding to first-token activations.
- Representation EngineeringA class of methods that modify how models internally process representations; SOO fine-tuning fits within this framework
- SEAL (Steerable Reasoning Calibration)Prior work using steering vectors to control reflection, motivated by reducing redundant self-reflection in long CoT.
- Three-Level Reflection FrameworkThe paper's proposed categorization of reflection into No Reflection, Intrinsic Reflection, and Triggered Reflection.
Datasets (4)
- cruxeval_o_advCode reasoning dataset with adversarially introduced errors used for reflection evaluation.
- Gemma3-4B-ITSmall open instruction-tuned LLM used as the second experimental subject.
- gsm8k_advMath reasoning dataset with adversarially introduced errors used for reflection evaluation.
- Qwen2.5-3BSmall open LLM used as one of two experimental subjects for reflection steering experiments.
Findings (12)
- Activation steering interventions generally succeed in guiding performance toward the desired direction (enhancement increases accuracy, inhibition decreases accuracy) compared to unsteered baseline
Core validation that identified latent directions correspond to meaningful control over reflective behavior.
- Clear accuracy stratification across three reflection levels on cruxeval_o_adv: Triggered (.065/.247) > Intrinsic (.040/.133) > No Reflection (.017/.051) for Qwen2.5-3B/Gemma3-4B-IT
Core empirical result validating the three-level reflection framework on code reasoning.
- Input embedding similarity baseline selects semantically related but non-reflective tokens (e.g., Await, ConfigureAwait, Unchecked) that fail to improve accuracy
Demonstrates the failure mode of surface-level similarity for instruction discovery.
- Enhancement steering consistently underperforms compared to directly providing explicit reflection instructions across all tested conditions
Shows that activation steering does not fully replicate mechanisms triggered by explicit prompting.
- Inhibition steering produces larger accuracy drops than enhancement steering produces accuracy gains, across all models and datasets tested
Key asymmetry finding: suppressing reflection is easier than inducing it.
- Steering vector-based instruction discovery outperforms input embedding similarity baseline for reflection-inducing instruction selection
Demonstrates that surface-level embedding similarity fails to capture reflective semantics.
- Top-5 instructions by µ(1→2) at ℓ=12 achieve average cosine similarity .9893 and average accuracy .5645 on gsm8k_adv for Gemma3-4B-IT
High cosine similarity for Gemma3 steering vectors suggests strong linear reflection structure.
- Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasets
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Steering vectors from µ(0→2) slightly outperform µ(1→2) for instruction discovery across datasets and models
Shows that contrasting No Reflection with Triggered Reflection provides a stronger signal than Intrinsic vs Triggered.
- Reflection-inducing directions emerge more clearly in higher layers (ℓ>5) for both models and datasets
Empirical observation about which network layers encode reflection-relevant information.
Claims (12)
- Suppressing reflection is considerably easier than inducing it, because inhibition requires the model to terminate reasoning while enhancement demands additional cognitive effort to re-examine reasoning trajectories.
Key asymmetry finding interpreted mechanistically by the authors.
- Activation steering of reflection has dual-use implications: it can enhance reflection as a defense mechanism, but malicious actors may also use it to inhibit reflection to facilitate jailbreaks.
Applied dual-use conclusion drawn from the paper's findings.
- Prompt-based jailbreak attacks effectively disable internal security-checking mechanisms by appending high-certainty leading prefixes that suppress reflection and deliberation.
Connection between reflection inhibition and jailbreak attack mechanisms.
- Reflective reasoning requires late-stage integration of semantic and reasoning signals, hence reflection-related directions emerge more clearly in higher network layers.
Interpretive claim about the locus of reflection in transformer architecture.
- The ease of suppressing reflection via activation steering raises security risks, as malicious actors could exploit reflection inhibition to bypass model safeguards.
Applied security implication derived from the asymmetry finding.
- Contrasting No Reflection with Triggered Reflection (µ(0→2)) provides a stronger reflection signal than contrasting Intrinsic with Triggered Reflection (µ(1→2)).
Empirical interpretation of which reference baseline yields more useful steering vectors.
- Accuracy does not vary linearly with latent reflection directions; instead it follows a more non-linear mapping that requires deeper theoretical treatment.
Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.
- Activation steering effectively biases latent representations but does not fully replicate the mechanisms triggered by explicit instruction.
Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
- Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
- Steering vectors enable systematic discovery of reflection-inducing instructions beyond trial-and-error prompt design.
Core applied contribution claim, supported by top-k accuracy comparisons.
Hypotheses (1)
- LLMs implicitly learn a distribution of 'consistent reasoning paths', and inconsistent reasoning forms statistical outliers with low probability under this distribution.
Theoretical hypothesis about the mechanism underlying LLM error detection and reflection.
Questions (6)
- Does the model internally maintain a form of 'consistency score' or probability mass over coherent reasoning trajectories, and how is this score modulated during reflection?
Promising future research direction about the internal mechanism of error detection.
- Whether conclusions about latent reflection directions generalize to larger LLMs, different architectures, or broader datasets remains to be verified.
Key limitation and open question about experimental scope.
- Do effective trigger instructions correspond to latent directions in the hidden space that implicitly induce the self-reflection process?
Second key research question motivating the latent direction analysis.
- How can we systematically identify effective reflection trigger instructions, rather than relying on trial-and-error?
First key research question motivating the methodology.
- How can non-linear reflection dynamics be formalized using probabilistic modeling and information theory?
Theoretical open question about the mathematical treatment of reflection mechanisms.
- What are the specific attention heads or MLP neurons (circuits) responsible for self-reflection in LLMs?
Future research question about pinpointing fine-grained mechanistic components of reflection.
Original abstract (expand)
Reflection, the ability of large language models (LLMs) to evaluate and revise their own reasoning, has been widely used to improve performance on complex reasoning tasks. Yet, most prior works emphasizes designing reflective prompting strategies or reinforcement learning objectives, leaving the inner mechanisms of reflection underexplored. In this paper, we investigate reflection through the lens of latent directions in model activations. We propose a methodology based on activation steering to characterize how instructions with different reflective intentions: no reflection, intrinsic reflection, and triggered reflection. By constructing steering vectors between these reflection levels, we demonstrate that (1) new reflection-inducing instructions can be systematically identified, (2) reflective behavior can be directly enhanced or suppressed through activation interventions, and (3) suppressing reflection is considerably easier than stimulating it. Experiments on GSM8k-adv and Cruxeval-o-adv with Qwen2.5-3B and Gemma3-4B-IT reveal clear stratification across reflection levels, and steering interventions confirm the controllability of reflection. Our findings highlight both opportunities (e.g., reflection-enhancing defenses) and risks (e.g., adversarial inhibition of reflection in jailbreak attacks). This work opens a path toward mechanistic understanding of reflective reasoning in LLMs.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Activation Steering for Aligned Open-ended Generation without Sacrificing CoherenceMartin Zborowski, Alberto Tosato, Gauthier Gidel, Tommaso Tosato Niklas Herbster2026≈ 87%
- ≈ 86%
- Analysing the Safety Pitfalls of Steering VectorsAlina Fastowski, Efstratios Zaradoukas, Bardh Prenkaj, Gjergji Kasneci Yuxiao Li2026≈ 86%
- ≈ 85%
- Enhancing Instruction Following of LLMs via Activation Steering with Dynamic RejectionJaehyung Kim Minjae Kang2026≈ 85%
- Probing the Robustness of Large Language Models Safety to Latent PerturbationsKexin Huang, Zongqi Wang, Yixu Wang, Jie Li, Yuanqi Yao, Yang Yao, Yujiu Yang, Yan Teng, Yingchun Wang Tianle Gu2025≈ 84%
- Fine-Grained Activation Steering: Steering Less, Achieving MoreTianjiao Li, Zixiao Zhu, Hanzhang Zhou, Junlang Qian, Li Zhang, Jia Jim Deryl Chua, Lee Onn Mak, Gee Wah Ng, Kezhi Mao Zijian Feng2026≈ 84%
- Harmful Intent as a Geometrically Recoverable Feature of LLM Residual StreamsIsaac Llorente-Saguer2026≈ 84%
- ≈ 84%
- Selective Steering: Norm-Preserving Control Through Discriminative Layer SelectionQuy-Anh Dang and Chris Ngo2026≈ 84%
- Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy BackdoorsTianxu Han, Naen Xu, Changjiang Li, Ping He, Chunyi Zhou, Jun Wang, Zhihui Fu, Tianyu Du, Jinbao Li, Shouling Ji Rui Yin2026≈ 84%
- Steering Conceptual Bias via Transformer Latent-Subspace ActivationVansh Sharma and Venkat Raman2025≈ 84%
- Steer Like the LLM: Activation Steering that Mimics PromptingGeert Heyman and Frederik Vandeputte2026≈ 84%
- Interpretable Steering of Large Language Models with Feature Guided Activation AdditionsChen Guang, Wesley Teng, Chandrasekaran Balaganesh, Tan Guoxian, Yan Ming Samuel Soo2025≈ 84%
- Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive TopicsDavid Montero, Roman Orus Iker Garc\'ia-Ferrero2026≈ 84%
- Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations CategoriesXianfeng Jiao, Yinghao Zhu, Zhongzhi Chen, Yifan He, Xu Chu, Junyi Gao, Yasha Wang, Liantao Ma Tianlong Wang2025≈ 84%
- ≈ 84%
- ≈ 83%
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Trainingin corpus2026≈ 82%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 82%
- Testing the Limits of Truth Directions in LLMsin corpus2026≈ 82%
- Psychological Steering of Large Language Modelsin corpus2026≈ 81%
- ≈ 81%
- ≈ 81%
- ≈ 81%
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Modelsin corpus2025≈ 81%
- Alignment faking in large language modelsin corpus2024≈ 80%
- Model Alignment Searchin corpus2025≈ 80%
- ≈ 80%
- ≈ 76%
+28 more