finding
active
finding:reflection-inducing-directions-emerge-more-clearly-in-higher-layers-l-5-for-both-models-and-datasetsReflection-inducing directions emerge more clearly in higher layers (ℓ>5) for both models and datasets
Empirical observation about which network layers encode reflection-relevant information.
Source paper
extracted_from(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan
Neighborhood — ranked by edge-count
Claims (1)
claim
- Interpretive claim about the locus of reflection in transformer architecture.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Truth directions emerge in earlier layers for factual tasks and later layers for arithmetic tasks.claim0.815Core empirical claim about the layer-dependence of truth direction emergence as a function of task type.
- Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma modelsfinding0.814Experiment 1 finding localizing where truth can be causally mediated
- Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
- Attribution finding suggesting the last layer directly controls reflection keyword generation
- Reflection does not only emerge in SFT or RL stages but arises earlier during pre-training.claim0.800Cited finding from Shah et al. contextualizing the training origins of reflection.
- Key limitation and open question about experimental scope.
- Structural finding about which attention heads control reflection behavior
- Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.