finding

active

finding:probes-trained-under-different-explicit-instruction-prompts-ask-correct-ask-t-f-ask-able-ask-arith-are-highly-aligned-with-each-other-in-cosine-similarity

Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.

Shows the passive vs. active divide is more important than the specific wording of instructions.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Claims (2)

claim

The model appears to encode truth differently under passive versus active truth evaluation prompts.
supports
Key finding from Section 5 based on low cosine similarity between no-prompt and ask-correct probes.
Probes trained under different explicit instruction variants are highly aligned with each other despite different wording.
supports
Shows the key divide is passive vs. active framing, not the specific wording of instructions.

Hypotheses (1)

hypothesis

We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.
associated_with
Motivating hypothesis for Section 5's investigation of prompt template effects.

Concepts (1)

concept

Truth direction universality
contradicts
The claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Under ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.finding0.854
Key improvement in cross-task generalization enabled by explicit instruction framing.
Using the ask-correct prompt improves cross-task generalization of arithmetic probes to factual tasks F0-F2.claim0.845
Finding that explicit correctness framing partially aligns truth directions across task families.
For simple factual tasks F0-F3, probe directions show a sharp geometric transition in middle layers, with late-layer probes converging to high cosine similarity; A3 and F4-F5 show no clear transition.finding0.810
Geometric evidence for convergence to stable truth directions only for simpler tasks.
The ask-arith prompt shows weaker generalization to factual tasks compared to other explicit prompts, suggesting a specialized arithmetic prompt does not create a unified truth direction across task families.claim0.804
From the cross-task generalization heatmaps in Appendix B.3.3.
Training probes on statements and their opposites improves generalization by mitigating non-truth features with opposite-sign correlationsclaim0.798
Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy
Probes trained on h_b activations achieve perfect test accuracy in every case; h_s probes achieve perfect accuracy in only 0.60% of casesfinding0.790
Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
With unrestricted vocabulary, models occasionally respond in non-English Yes/No equivalents (e.g., Sí, Nein) after truth-direction interventionsfinding0.786
Suggestive evidence for language-independent truth representation in LLMs
F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.finding0.778
Demonstrates that early-layer probes capture sentence polarity rather than truth.