finding

active

finding:f0-trained-probes-in-layers-4-10-show-inverted-separation-on-f1-auroc-0-systematically-misclassifying-true-statements-as-false

F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.

Demonstrates that early-layer probes capture sentence polarity rather than truth.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Papers (1)

paper

Testing the Limits of Truth Directions in LLMs
associated_with

Claims (1)

claim

Early-layer truth probes primarily capture sentence polarity rather than truth.
supports
Interpretation of the finding that early-layer F0-trained probes invert on F1 (negated statements).

Concepts (1)

concept

Truth direction universality
contradicts
The claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

For simple factual tasks F0-F3, probe directions show a sharp geometric transition in middle layers, with late-layer probes converging to high cosine similarity; A3 and F4-F5 show no clear transition.finding0.830
Geometric evidence for convergence to stable truth directions only for simpler tasks.
Truth probes fail to generalize to harder factual tasks F3-F5 regardless of prompt template, with AUROC near or below 0.6.finding0.819
Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
F3-trained probes achieve AUROC ~0.6 on F4, showing generalization breakdown from counting over 2 to 5 cities.finding0.818
Demonstrates the sharp drop in factual truth generalization at the counting boundary.
No-prompt probes show significant AUROC performance drop when evaluated on ask-correct activations, especially at layers where arithmetic truth directions emerge under no-prompt.finding0.810
Generalization evidence that truth probes are not invariant to model instructions.
Under ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.finding0.809
Key improvement in cross-task generalization enabled by explicit instruction framing.
Probes trained on A1 degrade significantly when evaluated on A2 and more on A3; training on A2 achieves only AUROC ~0.65 on A3.finding0.805
Shows rapid generalization decay for arithmetic truth directions with each additional operation.
The model converges to a more stable truth direction in middle-to-late layers, as evidenced by increasing cosine similarity between layer-wise probes.claim0.800
Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
Factual tasks F0-F3 reach near-perfect AUROC in early-to-mid layers of Llama-3.1-8B; arithmetic tasks A1-A3 emerge much later; counting tasks F4-F5 emerge late similar to arithmetic.finding0.789
Core empirical finding about layer-dependent truth direction emergence across task types.