finding

active

finding:under-ask-correct-arithmetic-tasks-a1-a2-show-gradual-auroc-increase-peaking-only-in-final-layers-unlike-the-sharp-transition-under-no-prompt

Under ask-correct, arithmetic tasks A1-A2 show gradual AUROC increase peaking only in final layers, unlike the sharp transition under no-prompt.

Shows that explicit instructions delay the emergence of truth directions in arithmetic tasks.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Hypotheses (1)

hypothesis

We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.
supports
Motivating hypothesis for Section 5's investigation of prompt template effects.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Factual tasks F0-F3 reach near-perfect AUROC in early-to-mid layers of Llama-3.1-8B; arithmetic tasks A1-A3 emerge much later; counting tasks F4-F5 emerge late similar to arithmetic.finding0.848
Core empirical finding about layer-dependent truth direction emergence across task types.
Under ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.finding0.822
Key improvement in cross-task generalization enabled by explicit instruction framing.
No-prompt probes show significant AUROC performance drop when evaluated on ask-correct activations, especially at layers where arithmetic truth directions emerge under no-prompt.finding0.815
Generalization evidence that truth probes are not invariant to model instructions.
The ask-arith prompt shows weaker generalization to factual tasks compared to other explicit prompts, suggesting a specialized arithmetic prompt does not create a unified truth direction across task families.claim0.776
From the cross-task generalization heatmaps in Appendix B.3.3.
Probes trained on A1 degrade significantly when evaluated on A2 and more on A3; training on A2 achieves only AUROC ~0.65 on A3.finding0.775
Shows rapid generalization decay for arithmetic truth directions with each additional operation.
F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.finding0.767
Demonstrates that early-layer probes capture sentence polarity rather than truth.
At layer 0 α=5, detection-adjusted logit difference is +3.19 and control increase is +3.22, a difference of only 0.03 logitsfinding0.756
Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
Pure factual-recall tasks F0-F2 show robust AUROC performance across all instruction template variations.claim0.755
Contrasts with harder tasks that are sensitive to prompt variations.