finding
active
finding:truth-probes-fail-to-generalize-to-harder-factual-tasks-f3-f5-regardless-of-prompt-template-with-auroc-near-or-below-0-6Truth probes fail to generalize to harder factual tasks F3-F5 regardless of prompt template, with AUROC near or below 0.6.
Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
Source paper
extracted_from(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi
Neighborhood — ranked by edge-count
Claims (1)
claim
- Establishes task difficulty as a hard limit that instructions cannot overcome.
Concepts (1)
concept
- Truth direction universalitycontradictsThe claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key improvement in cross-task generalization enabled by explicit instruction framing.
- Contrasts with harder tasks that are sensitive to prompt variations.
- Demonstrates the sharp drop in factual truth generalization at the counting boundary.
- Demonstrates that early-layer probes capture sentence polarity rather than truth.
- Generalization evidence that truth probes are not invariant to model instructions.
- Core empirical finding about layer-dependent truth direction emergence across task types.
- Shows instruction effects extend to harder factual tasks.
- Out-of-domain generalization showing deception features track general representational honesty