finding

active

finding:f3-trained-probes-achieve-auroc-0-6-on-f4-showing-generalization-breakdown-from-counting-over-2-to-5-cities

F3-trained probes achieve AUROC ~0.6 on F4, showing generalization breakdown from counting over 2 to 5 cities.

Demonstrates the sharp drop in factual truth generalization at the counting boundary.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Claims (1)

claim

Linear truth directions in LLMs are reliable primarily in factual recall cases and break down when truth assessment depends on computing and storing intermediate results.
supports
Central empirical conclusion of the paper about the fundamental limits of truth directions.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Probes trained on A1 degrade significantly when evaluated on A2 and more on A3; training on A2 achieves only AUROC ~0.65 on A3.finding0.834
Shows rapid generalization decay for arithmetic truth directions with each additional operation.
Truth probes fail to generalize to harder factual tasks F3-F5 regardless of prompt template, with AUROC near or below 0.6.finding0.825
Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.finding0.818
Demonstrates that early-layer probes capture sentence polarity rather than truth.
Under ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.finding0.807
Key improvement in cross-task generalization enabled by explicit instruction framing.
Factual tasks F0-F3 reach near-perfect AUROC in early-to-mid layers of Llama-3.1-8B; arithmetic tasks A1-A3 emerge much later; counting tasks F4-F5 emerge late similar to arithmetic.finding0.785
Core empirical finding about layer-dependent truth direction emergence across task types.
MM probes trained on larger_than+smaller_than achieve lower NIE than those trained on cities+neg_cities despite higher classification accuracy on sp_en_transfinding0.774
Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
For simple factual tasks F0-F3, probe directions show a sharp geometric transition in middle layers, with late-layer probes converging to high cosine similarity; A3 and F4-F5 show no clear transition.finding0.759
Geometric evidence for convergence to stable truth directions only for simpler tasks.
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesfinding0.756
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans