claim

active

claim:the-model-appears-to-encode-truth-differently-under-passive-versus-active-truth-evaluation-prompts

The model appears to encode truth differently under passive versus active truth evaluation prompts.

Key finding from Section 5 based on low cosine similarity between no-prompt and ask-correct probes.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Findings (2)

finding

Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.
supports
Shows the passive vs. active divide is more important than the specific wording of instructions.
No-prompt probes show significant AUROC performance drop when evaluated on ask-correct activations, especially at layers where arithmetic truth directions emerge under no-prompt.
supports
Generalization evidence that truth probes are not invariant to model instructions.

Claims (1)

claim

Universality claims for truth directions are more limited than previously assumed, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.
supports
Overarching conclusion summarizing the paper's contribution relative to prior universality claims.

Questions (1)

question

Does instructing the model to assess correctness affect the geometry of truth directions?
answered_by
One of the three guiding research questions of the paper.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The relationship between representations of truth of input statements and of model outputs in conjunction with model performance has not been investigated.question0.805
Future work direction identified in conclusion for enabling reliable truth assessment methods.
We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.hypothesis0.796
Motivating hypothesis for Section 5's investigation of prompt template effects.
Representational abstraction of truth may emerge more clearly with model scaleclaim0.794
Interpretation of weaker PCA separation and lower ASR in smaller models
Truth may be linearly separable in the model's representation space, but the structure is richer than a single linear axisclaim0.793
Interpretive synthesis of DIM and cone intervention successes
The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.785
Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
With unrestricted vocabulary, models occasionally respond in non-English Yes/No equivalents (e.g., Sí, Nein) after truth-direction interventionsfinding0.777
Suggestive evidence for language-independent truth representation in LLMs
Truth directions fail to generalize to harder tasks (F3-F5) regardless of prompt template because activations remain highly entangled.claim0.773
Establishes task difficulty as a hard limit that instructions cannot overcome.
The model converges to a more stable truth direction in middle-to-late layers, as evidenced by increasing cosine similarity between layer-wise probes.claim0.767
Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.