finding
active
finding:in-early-layers-the-polarity-dependent-direction-tp-explains-0-38-of-truth-related-variance-at-layer-7-vs-0-09-for-tg-by-middle-layers-tg-takes-over-and-tp-decaysIn early layers, the polarity-dependent direction tP explains ~0.38 of truth-related variance at layer 7 vs ~0.09 for tG; by middle layers tG takes over and tP decays.
Variance decomposition showing the disentanglement of polarity from truth across model depth.
Source paper
extracted_from(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi
Neighborhood — ranked by edge-count
Claims (1)
claim
- Interpretation of the finding that early-layer F0-trained probes invert on F1 (negated statements).
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows that Burger et al.'s layer choice corresponds to a transitional phase, not a universal property.
- A direction that classifies affirmative statements effectively but inverts for negated variants, dominating in early layers.
- Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
- Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma modelsfinding0.792Experiment 1 finding localizing where truth can be causally mediated
- A direction that classifies truth irrespective of sentence polarity, emerging and dominating in middle-to-late layers.
- Geometric evidence for convergence to stable truth directions only for simpler tasks.
- Reflection-inducing directions emerge more clearly in higher layers (ℓ>5) for both models and datasetsfinding0.783Empirical observation about which network layers encode reflection-relevant information.
- Argues against the single-layer analysis approach of prior work.