finding
active
finding:under-ask-correct-arithmetic-tasks-a1-a2-show-gradual-auroc-increase-peaking-only-in-final-layers-unlike-the-sharp-transition-under-no-promptUnder ask-correct, arithmetic tasks A1-A2 show gradual AUROC increase peaking only in final layers, unlike the sharp transition under no-prompt.
Shows that explicit instructions delay the emergence of truth directions in arithmetic tasks.
Source paper
extracted_from(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi
Neighborhood — ranked by edge-count
Hypotheses (1)
hypothesis
- Motivating hypothesis for Section 5's investigation of prompt template effects.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core empirical finding about layer-dependent truth direction emergence across task types.
- Key improvement in cross-task generalization enabled by explicit instruction framing.
- Generalization evidence that truth probes are not invariant to model instructions.
- From the cross-task generalization heatmaps in Appendix B.3.3.
- Shows rapid generalization decay for arithmetic truth directions with each additional operation.
- Demonstrates that early-layer probes capture sentence polarity rather than truth.
- Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
- Contrasts with harder tasks that are sensitive to prompt variations.