Cosine projection on reflection direction

Feature extraction method computing cosine similarity of hidden representations with reflection direction across all layers

Neighborhood — ranked by edge-count

Concepts (1)

concept

Reflection direction
uses
A direction in the model's representation space that governs self-reflection behavior, computed as mean difference between reflection and non-reflection embeddings

Methods (1)

method

Logistic regression correctness probe
uses
Logistic regression trained on GSM8k training set to predict answer correctness from projection features along reflection direction

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Reflection direction extractionmethod0.787
Computes reflection direction as mean difference between MLP and attention output representations of first tokens in reflection vs. non-reflection steps
Latent Direction of Reflectionconcept0.784
The paper's central construct: a vector in LLM activation space encoding the transition between reflection levels.
Reflection Symmetryconcept0.778
One of four key isometries; reflection across a line (mirror line or axis of reflection).
Cosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layersfinding0.763
Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
The last layer of the transformer has the largest projection magnitude on the reflection direction, likely because it directly controls generation of reflection keywordsclaim0.763
Interpretive claim from attention head attribution analysis in appendix
Cosine Similarity Measurementmethod0.752
Used to measure alignment between DIM direction and cone basis vectors to assess overlap
Accuracy does not vary linearly with latent reflection directions; instead it follows a more non-linear mapping that requires deeper theoretical treatment.claim0.751
Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.
Self-reflectionconcept0.750
The ability of reasoning LLMs to review and revise previous reasoning steps during inference