claim
active
claim:the-model-converges-to-a-more-stable-truth-direction-in-middle-to-late-layers-as-evidenced-by-increasing-cosine-similarity-between-layer-wise-probesThe model converges to a more stable truth direction in middle-to-late layers, as evidenced by increasing cosine similarity between layer-wise probes.
Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
Source paper
extracted_from(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi
Neighborhood — ranked by edge-count
Findings (1)
finding
- Geometric evidence for convergence to stable truth directions only for simpler tasks.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Variance decomposition showing the disentanglement of polarity from truth across model depth.
- Argues against the single-layer analysis approach of prior work.
- Motivating hypothesis for Section 5's investigation of prompt template effects.
- Reflection-inducing directions emerge more clearly in higher layers (ℓ>5) for both models and datasetsfinding0.802Empirical observation about which network layers encode reflection-relevant information.
- Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma modelsfinding0.802Experiment 1 finding localizing where truth can be causally mediated
- Interpretation of the finding that early-layer F0-trained probes invert on F1 (negated statements).
- Demonstrates that early-layer probes capture sentence polarity rather than truth.
- Key limitation of the PRH for non-bijective observations