finding

active

finding:under-ask-correct-probes-trained-on-arithmetic-tasks-a1-a3-generalize-almost-perfectly-to-factual-tasks-f0-f2-auroc-1-0-whereas-under-no-prompt-this-generalization-is-largely-absent

Under ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.

Key improvement in cross-task generalization enabled by explicit instruction framing.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Claims (1)

claim

Using the ask-correct prompt improves cross-task generalization of arithmetic probes to factual tasks F0-F2.
supports
Finding that explicit correctness framing partially aligns truth directions across task families.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.finding0.854
Shows the passive vs. active divide is more important than the specific wording of instructions.
No-prompt probes show significant AUROC performance drop when evaluated on ask-correct activations, especially at layers where arithmetic truth directions emerge under no-prompt.finding0.849
Generalization evidence that truth probes are not invariant to model instructions.
The ask-arith prompt shows weaker generalization to factual tasks compared to other explicit prompts, suggesting a specialized arithmetic prompt does not create a unified truth direction across task families.claim0.844
From the cross-task generalization heatmaps in Appendix B.3.3.
Truth probes fail to generalize to harder factual tasks F3-F5 regardless of prompt template, with AUROC near or below 0.6.finding0.841
Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
Factual tasks F0-F3 reach near-perfect AUROC in early-to-mid layers of Llama-3.1-8B; arithmetic tasks A1-A3 emerge much later; counting tasks F4-F5 emerge late similar to arithmetic.finding0.830
Core empirical finding about layer-dependent truth direction emergence across task types.
Under ask-correct, arithmetic tasks A1-A2 show gradual AUROC increase peaking only in final layers, unlike the sharp transition under no-prompt.finding0.822
Shows that explicit instructions delay the emergence of truth directions in arithmetic tasks.
Probes trained on A1 degrade significantly when evaluated on A2 and more on A3; training on A2 achieves only AUROC ~0.65 on A3.finding0.815
Shows rapid generalization decay for arithmetic truth directions with each additional operation.
F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.finding0.809
Demonstrates that early-layer probes capture sentence polarity rather than truth.