finding
active
finding:probes-trained-under-different-explicit-instruction-prompts-ask-correct-ask-t-f-ask-able-ask-arith-are-highly-aligned-with-each-other-in-cosine-similarityProbes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.
Shows the passive vs. active divide is more important than the specific wording of instructions.
Source paper
extracted_from(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi
Neighborhood — ranked by edge-count
Claims (2)
claim
- The model appears to encode truth differently under passive versus active truth evaluation prompts.supportsKey finding from Section 5 based on low cosine similarity between no-prompt and ask-correct probes.
- Shows the key divide is passive vs. active framing, not the specific wording of instructions.
Hypotheses (1)
hypothesis
- Motivating hypothesis for Section 5's investigation of prompt template effects.
Concepts (1)
concept
- Truth direction universalitycontradictsThe claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key improvement in cross-task generalization enabled by explicit instruction framing.
- Finding that explicit correctness framing partially aligns truth directions across task families.
- Geometric evidence for convergence to stable truth directions only for simpler tasks.
- From the cross-task generalization heatmaps in Appendix B.3.3.
- Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy
- Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
- Suggestive evidence for language-independent truth representation in LLMs
- Demonstrates that early-layer probes capture sentence polarity rather than truth.