paper
active
2025
paper:doi-10-48550-arxiv-2508-16989

Unveiling the Latent Directions of Reflection in Large Language Models

TL;DR

Reflection in LLMs corresponds to a recoverable latent direction in activation space, not merely a behavioral artifact of prompt engineering. Working with Qwen2.5-3B and Gemma3-4B-IT on the adversarial benchmarks gsm8k_adv and cruxeval_o_adv, the paper demonstrates a clear three-way stratification in exact-match accuracy: triggered reflection (average 0.397 / 0.586 on gsm8k_adv for the two models) substantially outperforms intrinsic reflection (0.295 / 0.335) and no-reflection conditions (0.051 / 0.147). The central method introduced is contrastive activation steering, in which steering vectors are computed as mean activation differences at the appended-instruction token position across reflection level pairs (e.g., µ⁰→² from No Reflection to Triggered Reflection), then used either to rank candidate trigger tokens by cosine similarity or to directly intervene in residual-stream activations at a chosen layer ℓ. Using layer-12 steering vectors, previously unreported tokens such as "However," "Oops," and "Validate" are identified as effective reflection triggers with cosine similarities up to 0.978 against the canonical steering direction, and these tokens achieve accuracy competitive with established triggers like "Wait" and "Alternatively." Crucially, inhibition interventions produce larger accuracy drops than enhancement interventions produce accuracy gains — suppressing reflection is mechanistically easier than inducing it — which the paper argues implies a concrete adversarial risk: jailbreak-style prefix attacks that append high-certainty continuations effectively exploit this asymmetry to disable internal safety-checking, and future defenses should target the identified reflection direction directly.

What to take away

  1. 1. On gsm8k_adv, Gemma3-4B-IT achieves average exact-match accuracy of 0.586 under triggered reflection (Wait/Alternatively/Check), 0.335 under intrinsic reflection ([EOS]/#/%), and 0.147 under no-reflection instructions (Answer/Result/Output), confirming a clean three-way stratification.
  2. 2. Qwen2.5-3B on cruxeval_o_adv shows a similarly ordered but lower-absolute stratification: triggered reflection averages 0.065, intrinsic reflection 0.040, and no reflection 0.017, demonstrating the pattern holds across model families and task types.
  3. 3. Contrastive activation steering vectors (µ^ℓ_{a→b}) are constructed as the mean activation difference at the appended-instruction token position across reflection level pairs, extracted from the output of the ℓ-th attention block added to the residual stream.
  4. 4. Steering vectors derived from µ⁰→² (No Reflection vs. Triggered Reflection) systematically outperform µ¹→² (Intrinsic vs. Triggered Reflection) for discovering new trigger instructions, indicating that the more extreme contrastive pair encodes a stronger latent signal.
  5. 5. At layer ℓ=12, the top-5 instructions selected by µ⁰→² cosine similarity for Gemma3-4B-IT include "Verify" (cosine sim 0.978), "Confirm" (0.964), "Initially" (0.961), "Oops" (0.958), and "Validate" (0.956), achieving average gsm8k_adv accuracy of 0.565 versus 0.526 for the embedding-similarity baseline.
  6. 6. Reflection-inducing latent directions emerge primarily in higher layers (ℓ > 5), consistent with late-stage integration of semantic and reasoning signals rather than shallow lexical processing.
  7. 7. Inhibition steering interventions produce accuracy drops that are systematically larger in magnitude than the accuracy gains produced by enhancement interventions, establishing a mechanistic asymmetry: terminating a reasoning trajectory requires less representational work than initiating error-correction.
  8. 8. A replicable methodology choice: candidate reflection-trigger tokens are drawn from the full Qwen2.5 and Gemma3 tokenizer vocabularies, normalized via NLTK stemming and lemmatization, then ranked by cosine similarity to µ^ℓ_{a→b} on a gsm8k_adv training split and evaluated on a held-out test split.
  9. 9. The input-embedding cosine similarity baseline selects semantically adjacent but non-reflective tokens such as "Await," "ConfigureAwait," and "Unchecked" that fail to improve accuracy, confirming that steering vectors capture a functionally distinct latent dimension not present in surface token similarity.
  10. 10. An open question the paper raises is whether LLMs internally maintain a continuous "consistency score" or probability mass over coherent reasoning trajectories, and whether this quantity is what steering vectors modulate during reflection — a hypothesis the paper leaves unformalized, calling for probabilistic and information-theoretic tools.

Peer brief — for seminar discussion

This NeurIPS 2025 Mechanistic Interpretability Workshop paper investigates whether LLM reflection — the capacity to detect errors in a prior chain-of-thought and revise conclusions — is encoded as a recoverable linear direction in activation space, rather than being a purely surface-level response to prompting. The experimental setup uses two open models, Qwen2.5-3B and Gemma3-4B-IT, evaluated on gsm8k_adv and cruxeval_o_adv, adversarial datasets that embed deliberate errors into reasoning steps to force genuine correction rather than answer-level guessing. Reflection is operationalized via three instruction levels appended after a flawed chain-of-thought: No Reflection (tokens like "Answer," "Output"), Intrinsic Reflection (semantically neutral tokens like [EOS], "#"), and Triggered Reflection (established cues like "Wait," "Alternatively," "Check"). Accuracy stratifies cleanly across these levels — on gsm8k_adv, Gemma3-4B-IT averages 0.586 / 0.335 / 0.147 from triggered to no-reflection, and Qwen2.5-3B averages 0.397 / 0.295 / 0.051 — validating the taxonomy before any steering is applied. The load-bearing contribution is the method of contrastive activation steering applied to reflection: steering vectors µ^ℓ_{a→b} are computed as the mean hidden-state difference at the appended-instruction token position between reflection level pairs, extracted at the output of each attention block plus its subsequent MLP. These vectors serve two functions. First, they rank candidate trigger tokens drawn from the full model vocabulary by cosine similarity to µ^ℓ_{0→2}; at layer 12, this recovers non-obvious triggers including "Oops" (cosine similarity 0.958 for Gemma3-4B-IT), "However," and "Validate" — tokens not in any prior reflective prompting catalog — while an input-embedding baseline instead surfaces tokens like "ConfigureAwait" that fail to improve accuracy. Second, the vectors enable direct inference-time intervention: adding µ^ℓ_{0→2} to activations of an "Answer"-prompted model nudges accuracy upward, while subtracting (applying µ^ℓ_{2→0}) collapses accuracy of a "Wait"-prompted model substantially. The asymmetry finding — inhibition produces larger accuracy swings than enhancement — is one of the paper's clearest quantitative claims. This asymmetry is argued to have security implications: jailbreak attacks that prepend high-certainty completions (e.g., "Absolutely! Here's") may mechanistically exploit the ease of reflection suppression to bypass safety-checking, while the harder problem of inducing reflection suggests an architectural avenue for defenses. The authors also float a theoretical hypothesis — that LLMs implicitly learn a distribution over consistent reasoning trajectories and that erroneous chains are statistical outliers under that distribution — but leave this unformalized. A critical reader would push back on experimental scale: only two models in the 3–4B parameter range are tested, and it is entirely unclear whether the identified latent direction is a surface-level association between instruction token identity and next-token statistics, rather than a genuine mechanistic encoding of reflective reasoning. The paper itself notes that activation patching, causal tracing, or circuit analysis — any of which would have been a stronger mechanistic alternative — were not applied, leaving open whether the steering vector is causally upstream of error-correction behavior or merely correlated with it. The scope is also restricted to situational reflection (correcting another source's chain-of-thought) and does not address self-reflection, limiting generalizability to the broader literature on iterative refinement and long chain-of-thought training.

Methods (7)

Frameworks (4)

Datasets (4)

  • cruxeval_o_adv
    Code reasoning dataset with adversarially introduced errors used for reflection evaluation.
  • Gemma3-4B-IT
    Small open instruction-tuned LLM used as the second experimental subject.
  • gsm8k_adv
    Math reasoning dataset with adversarially introduced errors used for reflection evaluation.
  • Qwen2.5-3B
    Small open LLM used as one of two experimental subjects for reflection steering experiments.

Findings (12)

Claims (12)

Original abstract (expand)

Reflection, the ability of large language models (LLMs) to evaluate and revise their own reasoning, has been widely used to improve performance on complex reasoning tasks. Yet, most prior works emphasizes designing reflective prompting strategies or reinforcement learning objectives, leaving the inner mechanisms of reflection underexplored. In this paper, we investigate reflection through the lens of latent directions in model activations. We propose a methodology based on activation steering to characterize how instructions with different reflective intentions: no reflection, intrinsic reflection, and triggered reflection. By constructing steering vectors between these reflection levels, we demonstrate that (1) new reflection-inducing instructions can be systematically identified, (2) reflective behavior can be directly enhanced or suppressed through activation interventions, and (3) suppressing reflection is considerably easier than stimulating it. Experiments on GSM8k-adv and Cruxeval-o-adv with Qwen2.5-3B and Gemma3-4B-IT reveal clear stratification across reflection levels, and steering interventions confirm the controllability of reflection. Our findings highlight both opportunities (e.g., reflection-enhancing defenses) and risks (e.g., adversarial inhibition of reflection in jailbreak attacks). This work opens a path toward mechanistic understanding of reflective reasoning in LLMs.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+28 more

Similar preprints — Semantic Scholar