concept
active
concept:reflection-directionReflection direction
A direction in the model's representation space that governs self-reflection behavior, computed as mean difference between reflection and non-reflection embeddings
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (1)
method
- Feature extraction method computing cosine similarity of hidden representations with reflection direction across all layers
Concepts (4)
concept
- Latent Direction of Reflectionrelated_toThe paper's central construct: a vector in LLM activation space encoding the transition between reflection levels.
- Self-reflectionassociated_withThe ability of reasoning LLMs to review and revise previous reasoning steps during inference
- Latent-Space Representationsassociated_withSubstrate on which causal emergence was computed across agent lifetimes; aligned with learning success.
- Internal uncertaintyassociated_withThe model's internal representation of uncertainty hypothesized to trigger self-reflection
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Computes reflection direction as mean difference between MLP and attention output representations of first tokens in reflection vs. non-reflection steps
- Reflection level where the model is forced to output an answer immediately without revisiting reasoning.
- One of four key isometries; reflection across a line (mirror line or axis of reflection).
- A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
- The specific form of reflection studied, where a model reflects on reasoning generated by another source.
- Responses that name or describe the observing act without performing it; negatively correlated with high scores
- Ratio of reflection steps to total reasoning steps, used to quantify reflection behavior
- Responses that perform the observing act; contrasted with described reflection; scorer rewards enacted over described