claim
active
claim:the-model-appears-to-encode-truth-differently-under-passive-versus-active-truth-evaluation-promptsThe model appears to encode truth differently under passive versus active truth evaluation prompts.
Key finding from Section 5 based on low cosine similarity between no-prompt and ask-correct probes.
Source paper
extracted_from(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi
Neighborhood — ranked by edge-count
Findings (2)
finding
- Shows the passive vs. active divide is more important than the specific wording of instructions.
- Generalization evidence that truth probes are not invariant to model instructions.
Claims (1)
claim
- Overarching conclusion summarizing the paper's contribution relative to prior universality claims.
Questions (1)
question
- Does instructing the model to assess correctness affect the geometry of truth directions?answered_byOne of the three guiding research questions of the paper.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Future work direction identified in conclusion for enabling reliable truth assessment methods.
- Motivating hypothesis for Section 5's investigation of prompt template effects.
- Interpretation of weaker PCA separation and lower ASR in smaller models
- Interpretive synthesis of DIM and cone intervention successes
- The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.785Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
- Suggestive evidence for language-independent truth representation in LLMs
- Establishes task difficulty as a hard limit that instructions cannot overcome.
- Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.