finding

active

finding:in-early-layers-the-polarity-dependent-direction-tp-explains-0-38-of-truth-related-variance-at-layer-7-vs-0-09-for-tg-by-middle-layers-tg-takes-over-and-tp-decays

In early layers, the polarity-dependent direction tP explains ~0.38 of truth-related variance at layer 7 vs ~0.09 for tG; by middle layers tG takes over and tP decays.

Variance decomposition showing the disentanglement of polarity from truth across model depth.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Claims (1)

claim

Early-layer truth probes primarily capture sentence polarity rather than truth.
supports
Interpretation of the finding that early-layer F0-trained probes invert on F1 (negated statements).

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

At layer 12 (the layer analyzed by Burger et al. 2024), tP and tG explain similar fractions of truth-related variance (~0.33 each).finding0.848
Shows that Burger et al.'s layer choice corresponds to a transitional phase, not a universal property.
Polarity-dependent truth direction (tP)concept0.833
A direction that classifies affirmative statements effectively but inverts for negated variants, dominating in early layers.
The model converges to a more stable truth direction in middle-to-late layers, as evidenced by increasing cosine similarity between layer-wise probes.claim0.822
Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma modelsfinding0.792
Experiment 1 finding localizing where truth can be causally mediated
Polarity-invariant truth direction (tG)concept0.786
A direction that classifies truth irrespective of sentence polarity, emerging and dominating in middle-to-late layers.
For simple factual tasks F0-F3, probe directions show a sharp geometric transition in middle layers, with late-layer probes converging to high cosine similarity; A3 and F4-F5 show no clear transition.finding0.785
Geometric evidence for convergence to stable truth directions only for simpler tasks.
Reflection-inducing directions emerge more clearly in higher layers (ℓ>5) for both models and datasetsfinding0.783
Empirical observation about which network layers encode reflection-relevant information.
No single layer is universally optimal for probing truth directions; different tasks peak at different layers.claim0.774
Argues against the single-layer analysis approach of prior work.