finding

active

finding:no-prompt-probes-show-significant-auroc-performance-drop-when-evaluated-on-ask-correct-activations-especially-at-layers-where-arithmetic-truth-directions-emerge-under-no-prompt

No-prompt probes show significant AUROC performance drop when evaluated on ask-correct activations, especially at layers where arithmetic truth directions emerge under no-prompt.

Generalization evidence that truth probes are not invariant to model instructions.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Claims (1)

claim

The model appears to encode truth differently under passive versus active truth evaluation prompts.
supports
Key finding from Section 5 based on low cosine similarity between no-prompt and ask-correct probes.

Questions (1)

question

Will the no-prompt truth directions generalize to ask-correct activations?
answered_by
Specific question motivating the cross-template generalization experiment in Section 5.2.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Under ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.finding0.849
Key improvement in cross-task generalization enabled by explicit instruction framing.
Under ask-correct, arithmetic tasks A1-A2 show gradual AUROC increase peaking only in final layers, unlike the sharp transition under no-prompt.finding0.815
Shows that explicit instructions delay the emergence of truth directions in arithmetic tasks.
Truth probes fail to generalize to harder factual tasks F3-F5 regardless of prompt template, with AUROC near or below 0.6.finding0.810
Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.finding0.810
Demonstrates that early-layer probes capture sentence polarity rather than truth.
Probes trained on A1 degrade significantly when evaluated on A2 and more on A3; training on A2 achieves only AUROC ~0.65 on A3.finding0.789
Shows rapid generalization decay for arithmetic truth directions with each additional operation.
Activation probing detects final answer belief earlier in CoT than CoT monitor on both models, with especially pronounced gap on easy MMLU questionsfinding0.781
Comparative finding establishing activation probing as superior to text-level monitoring for early belief detection
Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.finding0.776
Shows the passive vs. active divide is more important than the specific wording of instructions.
The ask-arith prompt shows weaker generalization to factual tasks compared to other explicit prompts, suggesting a specialized arithmetic prompt does not create a unified truth direction across task families.claim0.771
From the cross-task generalization heatmaps in Appendix B.3.3.