Latent Direction of Reflection

The paper's central construct: a vector in LLM activation space encoding the transition between reflection levels.

Neighborhood — ranked by edge-count

paper

framework

SEAL (Steerable Reasoning Calibration)
cites
Prior work using steering vectors to control reflection, motivated by reducing redundant self-reflection in long CoT.

method

Activation Steering
implements
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.

concept

Reflection direction
related_to
A direction in the model's representation space that governs self-reflection behavior, computed as mean difference between reflection and non-reflection embeddings
steering vectors
associated_with
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
Refusal Direction in LLMs
analogous_to
Prior finding that LLM refusal is mediated by a single latent direction, analogous to this paper's reflection direction.

artifact

https://github.com/d09942015ntu/unveiling_directions_reflection
about
GitHub repository containing experimental details and code for the paper.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Reflection direction extractionmethod0.813
Computes reflection direction as mean difference between MLP and attention output representations of first tokens in reflection vs. non-reflection steps
Accuracy does not vary linearly with latent reflection directions; instead it follows a more non-linear mapping that requires deeper theoretical treatment.claim0.798
Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.
No Reflectionconcept0.795
Reflection level where the model is forced to output an answer immediately without revisiting reasoning.
A linear reflection direction exists in reasoning LLMs' latent representation space that governs self-reflection behaviorclaim0.793
Core claim of ReflCtrl that a single direction captures and controls reflection
Whether conclusions about latent reflection directions generalize to larger LLMs, different architectures, or broader datasets remains to be verified.question0.790
Key limitation and open question about experimental scope.
Latent Reflective Capacityconcept0.788
The maximum reflective capacity a model can reach under the right framing; separable from default accessibility
latent reasoningconcept0.785
Reasoning approach using learnable hidden embeddings.
Cosine projection on reflection directionmethod0.784
Feature extraction method computing cosine similarity of hidden representations with reflection direction across all layers