finding

active

finding:for-simple-factual-tasks-f0-f3-probe-directions-show-a-sharp-geometric-transition-in-middle-layers-with-late-layer-probes-converging-to-high-cosine-similarity-a3-and-f4-f5-show-no-clear-transition

For simple factual tasks F0-F3, probe directions show a sharp geometric transition in middle layers, with late-layer probes converging to high cosine similarity; A3 and F4-F5 show no clear transition.

Geometric evidence for convergence to stable truth directions only for simpler tasks.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Claims (1)

claim

The model converges to a more stable truth direction in middle-to-late layers, as evidenced by increasing cosine similarity between layer-wise probes.
supports
Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.finding0.830
Demonstrates that early-layer probes capture sentence polarity rather than truth.
Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.finding0.810
Shows the passive vs. active divide is more important than the specific wording of instructions.
Layer-wise geometry shows early dip, mid-layer alignment, and late standardization across tasksclaim0.790
Qualitative pattern from E3.
Gemma-3-4B-it shows three-stage layer trajectory and S(ℓ) peak despite scale differences in dr and ρdfinding0.788
E3 backbone generalization finding for Gemma; validates pattern across diverse architectures
Under ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.finding0.787
Key improvement in cross-task generalization enabled by explicit instruction framing.
No single layer is universally optimal for probing truth directions; different tasks peak at different layers.claim0.785
Argues against the single-layer analysis approach of prior work.
In early layers, the polarity-dependent direction tP explains ~0.38 of truth-related variance at layer 7 vs ~0.09 for tG; by middle layers tG takes over and tP decays.finding0.785
Variance decomposition showing the disentanglement of polarity from truth across model depth.
Factual tasks F0-F3 reach near-perfect AUROC in early-to-mid layers of Llama-3.1-8B; arithmetic tasks A1-A3 emerge much later; counting tasks F4-F5 emerge late similar to arithmetic.finding0.784
Core empirical finding about layer-dependent truth direction emergence across task types.