finding

active

finding:truth-probes-fail-to-generalize-to-harder-factual-tasks-f3-f5-regardless-of-prompt-template-with-auroc-near-or-below-0-6

Truth probes fail to generalize to harder factual tasks F3-F5 regardless of prompt template, with AUROC near or below 0.6.

Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Claims (1)

claim

Truth directions fail to generalize to harder tasks (F3-F5) regardless of prompt template because activations remain highly entangled.
supports
Establishes task difficulty as a hard limit that instructions cannot overcome.

Concepts (1)

concept

Truth direction universality
contradicts
The claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Under ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.finding0.841
Key improvement in cross-task generalization enabled by explicit instruction framing.
Pure factual-recall tasks F0-F2 show robust AUROC performance across all instruction template variations.claim0.832
Contrasts with harder tasks that are sensitive to prompt variations.
F3-trained probes achieve AUROC ~0.6 on F4, showing generalization breakdown from counting over 2 to 5 cities.finding0.825
Demonstrates the sharp drop in factual truth generalization at the counting boundary.
F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.finding0.819
Demonstrates that early-layer probes capture sentence polarity rather than truth.
No-prompt probes show significant AUROC performance drop when evaluated on ask-correct activations, especially at layers where arithmetic truth directions emerge under no-prompt.finding0.810
Generalization evidence that truth probes are not invariant to model instructions.
Factual tasks F0-F3 reach near-perfect AUROC in early-to-mid layers of Llama-3.1-8B; arithmetic tasks A1-A3 emerge much later; counting tasks F4-F5 emerge late similar to arithmetic.finding0.802
Core empirical finding about layer-dependent truth direction emergence across task types.
The ask-correct template delays truth direction emergence for F3 and reduces performance for F4-F5 compared to no-prompt.finding0.796
Shows instruction effects extend to harder factual tasks.
Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categoriesfinding0.791
Out-of-domain generalization showing deception features track general representational honesty