finding
active
finding:no-prompt-probes-show-significant-auroc-performance-drop-when-evaluated-on-ask-correct-activations-especially-at-layers-where-arithmetic-truth-directions-emerge-under-no-promptNo-prompt probes show significant AUROC performance drop when evaluated on ask-correct activations, especially at layers where arithmetic truth directions emerge under no-prompt.
Generalization evidence that truth probes are not invariant to model instructions.
Source paper
extracted_from(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi
Neighborhood — ranked by edge-count
Claims (1)
claim
- The model appears to encode truth differently under passive versus active truth evaluation prompts.supportsKey finding from Section 5 based on low cosine similarity between no-prompt and ask-correct probes.
Questions (1)
question
- Specific question motivating the cross-template generalization experiment in Section 5.2.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key improvement in cross-task generalization enabled by explicit instruction framing.
- Shows that explicit instructions delay the emergence of truth directions in arithmetic tasks.
- Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
- Demonstrates that early-layer probes capture sentence polarity rather than truth.
- Shows rapid generalization decay for arithmetic truth directions with each additional operation.
- Comparative finding establishing activation probing as superior to text-level monitoring for early belief detection
- Shows the passive vs. active divide is more important than the specific wording of instructions.
- From the cross-task generalization heatmaps in Appendix B.3.3.