method
active
method:cosine-projection-on-reflection-directionCosine projection on reflection direction
Feature extraction method computing cosine similarity of hidden representations with reflection direction across all layers
Neighborhood — ranked by edge-count
Concepts (1)
concept
- A direction in the model's representation space that governs self-reflection behavior, computed as mean difference between reflection and non-reflection embeddings
Methods (1)
method
- Logistic regression trained on GSM8k training set to predict answer correctness from projection features along reflection direction
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Computes reflection direction as mean difference between MLP and attention output representations of first tokens in reflection vs. non-reflection steps
- The paper's central construct: a vector in LLM activation space encoding the transition between reflection levels.
- One of four key isometries; reflection across a line (mirror line or axis of reflection).
- Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
- Interpretive claim from attention head attribution analysis in appendix
- Used to measure alignment between DIM direction and cone basis vectors to assess overlap
- Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.
- The ability of reasoning LLMs to review and revise previous reasoning steps during inference