finding

active

finding:reflection-inducing-directions-emerge-more-clearly-in-higher-layers-l-5-for-both-models-and-datasets

Reflection-inducing directions emerge more clearly in higher layers (ℓ>5) for both models and datasets

Empirical observation about which network layers encode reflection-relevant information.

Source paper

extracted_from

Unveiling the Latent Directions of Reflection in Large Language Models

(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan

Neighborhood — ranked by edge-count

Claims (1)

claim

Reflective reasoning requires late-stage integration of semantic and reasoning signals, hence reflection-related directions emerge more clearly in higher network layers.
supports
Interpretive claim about the locus of reflection in transformer architecture.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Truth directions emerge in earlier layers for factual tasks and later layers for arithmetic tasks.claim0.815
Core empirical claim about the layer-dependence of truth direction emergence as a function of task type.
Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma modelsfinding0.814
Experiment 1 finding localizing where truth can be causally mediated
The model converges to a more stable truth direction in middle-to-late layers, as evidenced by increasing cosine similarity between layer-wise probes.claim0.802
Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
Layer 27 (last layer) has largest projection magnitude on the reflection direction among all attention head layers in DeepSeek-R1-Qwen-1.5Bfinding0.801
Attribution finding suggesting the last layer directly controls reflection keyword generation
Reflection does not only emerge in SFT or RL stages but arises earlier during pre-training.claim0.800
Cited finding from Shah et al. contextualizing the training origins of reflection.
Whether conclusions about latent reflection directions generalize to larger LLMs, different architectures, or broader datasets remains to be verified.question0.799
Key limitation and open question about experimental scope.
Attention heads with positive projection on reflection direction are sparse and located mostly in deeper layers of DeepSeek-R1-Qwen-1.5Bfinding0.799
Structural finding about which attention heads control reflection behavior
Accuracy does not vary linearly with latent reflection directions; instead it follows a more non-linear mapping that requires deeper theoretical treatment.claim0.798
Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.