claim
active
claim:early-layer-truth-probes-primarily-capture-sentence-polarity-rather-than-truthEarly-layer truth probes primarily capture sentence polarity rather than truth.
Interpretation of the finding that early-layer F0-trained probes invert on F1 (negated statements).
Source paper
extracted_from(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi
Neighborhood — ranked by edge-count
Findings (2)
finding
- Demonstrates that early-layer probes capture sentence polarity rather than truth.
- Variance decomposition showing the disentanglement of polarity from truth across model depth.
Claims (1)
claim
- Methodological critique of prior work that fixed a single layer for truth probing.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
- Argues against the single-layer analysis approach of prior work.
- Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
- Geometric evidence for convergence to stable truth directions only for simpler tasks.
- Truth directions emerge in earlier layers for factual tasks and later layers for arithmetic tasks.claim0.755Core empirical claim about the layer-dependence of truth direction emergence as a function of task type.
- Hypothesized intermediate feature explaining antipodal alignment between cities and neg_cities in early-middle layers
- Overarching conclusion summarizing the paper's contribution relative to prior universality claims.