claim
active
claim:reflection-is-not-merely-a-behavioral-artifact-of-prompting-but-a-phenomenon-encoded-in-the-model-s-activation-spaceReflection is not merely a behavioral artifact of prompting but a phenomenon encoded in the model's activation space.
Central interpretive claim of the paper, supported by steering vector experiments.
Source paper
extracted_from(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan
Neighborhood — ranked by edge-count
Findings (2)
finding
- Core empirical result validating the three-level reflection framework on code reasoning.
- Core validation that identified latent directions correspond to meaningful control over reflective behavior.
Questions (1)
question
- Second key research question motivating the latent direction analysis.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core claim of ReflCtrl that a single direction captures and controls reflection
- Mechanistic analog connecting Lindsey's layer-localized findings to the scorer's enacted/described distinction
- Interpretive claim about the locus of reflection in transformer architecture.
- First central research question motivating ReflCtrl investigation
- Applied dual-use conclusion drawn from the paper's findings.
- Empirical interpretation of which reference baseline yields more useful steering vectors.
- Reflection does not only emerge in SFT or RL stages but arises earlier during pre-training.claim0.783Cited finding from Shah et al. contextualizing the training origins of reflection.
- Asks what underlying reality causes the consistent choices.