finding

active

finding:attention-heads-with-positive-projection-on-reflection-direction-are-sparse-and-located-mostly-in-deeper-layers-of-deepseek-r1-qwen-1-5b

Attention heads with positive projection on reflection direction are sparse and located mostly in deeper layers of DeepSeek-R1-Qwen-1.5B

Structural finding about which attention heads control reflection behavior

Source paper

extracted_from

ReflCtrl: Controlling LLM Reflection via Representation Engineering

(2025) · Ge Yan · Sun, Chung-En · Tsui-Wei · Weng

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Layer 27 (last layer) has largest projection magnitude on the reflection direction among all attention head layers in DeepSeek-R1-Qwen-1.5Bfinding0.860
Attribution finding suggesting the last layer directly controls reflection keyword generation
All 32 attention heads at layer 3 achieve 100% localization accuracy for injections at layer 2 (5-way classification, 20% chance)finding0.802
Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
Reflection-inducing directions emerge more clearly in higher layers (ℓ>5) for both models and datasetsfinding0.799
Empirical observation about which network layers encode reflection-relevant information.
Most attention heads in one-layer models dedicate an enormous fraction of their capacity to copying behaviorclaim0.785
Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
10 out of 12 attention heads in the 12-head one-layer model show significantly positive eigenvalue sums, indicating copying behaviorfinding0.782
Quantitative result from eigenvalue analysis of expanded OV matrices; confirmed by qualitative inspection
What are the specific attention heads or MLP neurons (circuits) responsible for self-reflection in LLMs?question0.780
Future research question about pinpointing fine-grained mechanistic components of reflection.
Thought detection peaks at ~2/3 layer depth; intention checking peaks at ~1/2 layer depth.finding0.771
Lindsey (2026) differential layer performance explained by Janus's path combinatorics — different tasks use different path distributions.
Similar superposition phenomena may exist in self-attention layers and similar sparse autoencoder methods may extract useful structure from attentionhypothesis0.767
Extension of superposition hypothesis to attention layers as future research direction