finding

active

finding:probes-trained-on-a1-degrade-significantly-when-evaluated-on-a2-and-more-on-a3-training-on-a2-achieves-only-auroc-0-65-on-a3

Probes trained on A1 degrade significantly when evaluated on A2 and more on A3; training on A2 achieves only AUROC ~0.65 on A3.

Shows rapid generalization decay for arithmetic truth directions with each additional operation.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Claims (1)

claim

Linear truth directions in LLMs are reliable primarily in factual recall cases and break down when truth assessment depends on computing and storing intermediate results.
supports
Central empirical conclusion of the paper about the fundamental limits of truth directions.

Hypotheses (1)

hypothesis

We hypothesize that degraded generalization on benchmarks like MMLU may reflect the computational demands of the tasks.
supports
Connecting the paper's task-difficulty findings to prior observations of weak generalization on complex QA benchmarks.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

F3-trained probes achieve AUROC ~0.6 on F4, showing generalization breakdown from counting over 2 to 5 cities.finding0.834
Demonstrates the sharp drop in factual truth generalization at the counting boundary.
Under ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.finding0.815
Key improvement in cross-task generalization enabled by explicit instruction framing.
F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.finding0.805
Demonstrates that early-layer probes capture sentence polarity rather than truth.
Probes trained on h_b activations achieve perfect test accuracy in every case; h_s probes achieve perfect accuracy in only 0.60% of casesfinding0.792
Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
No-prompt probes show significant AUROC performance drop when evaluated on ask-correct activations, especially at layers where arithmetic truth directions emerge under no-prompt.finding0.789
Generalization evidence that truth probes are not invariant to model instructions.
Truth probes fail to generalize to harder factual tasks F3-F5 regardless of prompt template, with AUROC near or below 0.6.finding0.784
Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
Under ask-correct, arithmetic tasks A1-A2 show gradual AUROC increase peaking only in final layers, unlike the sharp transition under no-prompt.finding0.775
Shows that explicit instructions delay the emergence of truth directions in arithmetic tasks.
Probes trained on the likely dataset perform worse than chance on datasets with anti-correlations between text probability and truthfinding0.765
Shows that truth representations are not reducible to text probability representations