claim

active

claim:the-model-converges-to-a-more-stable-truth-direction-in-middle-to-late-layers-as-evidenced-by-increasing-cosine-similarity-between-layer-wise-probes

The model converges to a more stable truth direction in middle-to-late layers, as evidenced by increasing cosine similarity between layer-wise probes.

Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Findings (1)

finding

For simple factual tasks F0-F3, probe directions show a sharp geometric transition in middle layers, with late-layer probes converging to high cosine similarity; A3 and F4-F5 show no clear transition.
supports
Geometric evidence for convergence to stable truth directions only for simpler tasks.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

In early layers, the polarity-dependent direction tP explains ~0.38 of truth-related variance at layer 7 vs ~0.09 for tG; by middle layers tG takes over and tP decays.finding0.822
Variance decomposition showing the disentanglement of polarity from truth across model depth.
No single layer is universally optimal for probing truth directions; different tasks peak at different layers.claim0.812
Argues against the single-layer analysis approach of prior work.
We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.hypothesis0.803
Motivating hypothesis for Section 5's investigation of prompt template effects.
Reflection-inducing directions emerge more clearly in higher layers (ℓ>5) for both models and datasetsfinding0.802
Empirical observation about which network layers encode reflection-relevant information.
Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma modelsfinding0.802
Experiment 1 finding localizing where truth can be causally mediated
Early-layer truth probes primarily capture sentence polarity rather than truth.claim0.800
Interpretation of the finding that early-layer F0-trained probes invert on F1 (negated statements).
F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.finding0.800
Demonstrates that early-layer probes capture sentence polarity rather than truth.
Different models cannot converge to the same representation if they have access to fundamentally different information; convergence is capped by mutual information between input signalsclaim0.800
Key limitation of the PRH for non-bijective observations