concept
active
concept:latent-direction-of-reflectionLatent Direction of Reflection
The paper's central construct: a vector in LLM activation space encoding the transition between reflection levels.
Neighborhood — ranked by edge-count
Papers (1)
paper
Frameworks (1)
framework
- Prior work using steering vectors to control reflection, motivated by reducing redundant self-reflection in long CoT.
Methods (1)
method
- Activation SteeringimplementsCausal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
Concepts (3)
concept
- Reflection directionrelated_toA direction in the model's representation space that governs self-reflection behavior, computed as mean difference between reflection and non-reflection embeddings
- steering vectorsassociated_withA method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
- Refusal Direction in LLMsanalogous_toPrior finding that LLM refusal is mediated by a single latent direction, analogous to this paper's reflection direction.
Artifacts (1)
artifact
- GitHub repository containing experimental details and code for the paper.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Computes reflection direction as mean difference between MLP and attention output representations of first tokens in reflection vs. non-reflection steps
- Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.
- Reflection level where the model is forced to output an answer immediately without revisiting reasoning.
- Core claim of ReflCtrl that a single direction captures and controls reflection
- Key limitation and open question about experimental scope.
- The maximum reflective capacity a model can reach under the right framing; separable from default accessibility
- Reasoning approach using learnable hidden embeddings.
- Feature extraction method computing cosine similarity of hidden representations with reflection direction across all layers