finding
active
finding:probes-trained-on-a1-degrade-significantly-when-evaluated-on-a2-and-more-on-a3-training-on-a2-achieves-only-auroc-0-65-on-a3Probes trained on A1 degrade significantly when evaluated on A2 and more on A3; training on A2 achieves only AUROC ~0.65 on A3.
Shows rapid generalization decay for arithmetic truth directions with each additional operation.
Source paper
extracted_from(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central empirical conclusion of the paper about the fundamental limits of truth directions.
Hypotheses (1)
hypothesis
- Connecting the paper's task-difficulty findings to prior observations of weak generalization on complex QA benchmarks.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates the sharp drop in factual truth generalization at the counting boundary.
- Key improvement in cross-task generalization enabled by explicit instruction framing.
- Demonstrates that early-layer probes capture sentence polarity rather than truth.
- Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
- Generalization evidence that truth probes are not invariant to model instructions.
- Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
- Shows that explicit instructions delay the emergence of truth directions in arithmetic tasks.
- Shows that truth representations are not reducible to text probability representations