claim

active

claim:pure-factual-recall-tasks-f0-f2-show-robust-auroc-performance-across-all-instruction-template-variations

Pure factual-recall tasks F0-F2 show robust AUROC performance across all instruction template variations.

Contrasts with harder tasks that are sensitive to prompt variations.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Claims (1)

claim

Truth directions fail to generalize to harder tasks (F3-F5) regardless of prompt template because activations remain highly entangled.
associated_with
Establishes task difficulty as a hard limit that instructions cannot overcome.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Truth probes fail to generalize to harder factual tasks F3-F5 regardless of prompt template, with AUROC near or below 0.6.finding0.832
Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
Factual tasks F0-F3 reach near-perfect AUROC in early-to-mid layers of Llama-3.1-8B; arithmetic tasks A1-A3 emerge much later; counting tasks F4-F5 emerge late similar to arithmetic.finding0.811
Core empirical finding about layer-dependent truth direction emergence across task types.
Under ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.finding0.803
Key improvement in cross-task generalization enabled by explicit instruction framing.
Within-family factual generalization (F0-F2) is consistently strong across all models and prompt settings.finding0.780
Establishes a reliable baseline for factual truth direction universality within simple factual recall.
F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.finding0.775
Demonstrates that early-layer probes capture sentence polarity rather than truth.
The ask-correct template delays truth direction emergence for F3 and reduces performance for F4-F5 compared to no-prompt.finding0.770
Shows instruction effects extend to harder factual tasks.
No-prompt probes show significant AUROC performance drop when evaluated on ask-correct activations, especially at layers where arithmetic truth directions emerge under no-prompt.finding0.769
Generalization evidence that truth probes are not invariant to model instructions.
The ask-arith prompt shows weaker generalization to factual tasks compared to other explicit prompts, suggesting a specialized arithmetic prompt does not create a unified truth direction across task families.claim0.761
From the cross-task generalization heatmaps in Appendix B.3.3.