finding
active
finding:f0-trained-probes-in-layers-4-10-show-inverted-separation-on-f1-auroc-0-systematically-misclassifying-true-statements-as-falseF0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.
Demonstrates that early-layer probes capture sentence polarity rather than truth.
Source paper
extracted_from(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi
Neighborhood — ranked by edge-count
Papers (1)
paper
- Testing the Limits of Truth Directions in LLMsassociated_with
Claims (1)
claim
- Interpretation of the finding that early-layer F0-trained probes invert on F1 (negated statements).
Concepts (1)
concept
- Truth direction universalitycontradictsThe claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Geometric evidence for convergence to stable truth directions only for simpler tasks.
- Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
- Demonstrates the sharp drop in factual truth generalization at the counting boundary.
- Generalization evidence that truth probes are not invariant to model instructions.
- Key improvement in cross-task generalization enabled by explicit instruction framing.
- Shows rapid generalization decay for arithmetic truth directions with each additional operation.
- Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
- Core empirical finding about layer-dependent truth direction emergence across task types.