finding
active
finding:under-ask-correct-probes-trained-on-arithmetic-tasks-a1-a3-generalize-almost-perfectly-to-factual-tasks-f0-f2-auroc-1-0-whereas-under-no-prompt-this-generalization-is-largely-absentUnder ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.
Key improvement in cross-task generalization enabled by explicit instruction framing.
Source paper
extracted_from(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi
Neighborhood — ranked by edge-count
Claims (1)
claim
- Finding that explicit correctness framing partially aligns truth directions across task families.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows the passive vs. active divide is more important than the specific wording of instructions.
- Generalization evidence that truth probes are not invariant to model instructions.
- From the cross-task generalization heatmaps in Appendix B.3.3.
- Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
- Core empirical finding about layer-dependent truth direction emergence across task types.
- Shows that explicit instructions delay the emergence of truth directions in arithmetic tasks.
- Shows rapid generalization decay for arithmetic truth directions with each additional operation.
- Demonstrates that early-layer probes capture sentence polarity rather than truth.