claim

active

claim:early-layer-truth-probes-primarily-capture-sentence-polarity-rather-than-truth

Early-layer truth probes primarily capture sentence polarity rather than truth.

Interpretation of the finding that early-layer F0-trained probes invert on F1 (negated statements).

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Findings (2)

finding

F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.
supports
Demonstrates that early-layer probes capture sentence polarity rather than truth.
In early layers, the polarity-dependent direction tP explains ~0.38 of truth-related variance at layer 7 vs ~0.09 for tG; by middle layers tG takes over and tP decays.
supports
Variance decomposition showing the disentanglement of polarity from truth across model depth.

Claims (1)

claim

Single-layer analyses can be misleading because early-layer truth directions may reflect surface features with limited cross-task generalization.
supports
Methodological critique of prior work that fixed a single layer for truth probing.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The model converges to a more stable truth direction in middle-to-late layers, as evidenced by increasing cosine similarity between layer-wise probes.claim0.800
Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
No single layer is universally optimal for probing truth directions; different tasks peak at different layers.claim0.792
Argues against the single-layer analysis approach of prior work.
Truth probes fail to generalize to harder factual tasks F3-F5 regardless of prompt template, with AUROC near or below 0.6.finding0.779
Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
For simple factual tasks F0-F3, probe directions show a sharp geometric transition in middle layers, with late-layer probes converging to high cosine similarity; A3 and F4-F5 show no clear transition.finding0.766
Geometric evidence for convergence to stable truth directions only for simpler tasks.
The depth-probe paper's central finding—scorer inversion—mirrors its own unpublished status recursively.claim0.759
Truth directions emerge in earlier layers for factual tasks and later layers for arithmetic tasks.claim0.755
Core empirical claim about the layer-dependence of truth direction emergence as a function of task type.
In early layers, LLaMA-2-13B represents a 'close association' feature that correlates with truth on cities but anti-correlates on neg_citiesclaim0.752
Hypothesized intermediate feature explaining antipodal alignment between cities and neg_cities in early-middle layers
Universality claims for truth directions are more limited than previously assumed, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.claim0.751
Overarching conclusion summarizing the paper's contribution relative to prior universality claims.