finding
active
finding:f3-trained-probes-achieve-auroc-0-6-on-f4-showing-generalization-breakdown-from-counting-over-2-to-5-citiesF3-trained probes achieve AUROC ~0.6 on F4, showing generalization breakdown from counting over 2 to 5 cities.
Demonstrates the sharp drop in factual truth generalization at the counting boundary.
Source paper
extracted_from(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central empirical conclusion of the paper about the fundamental limits of truth directions.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows rapid generalization decay for arithmetic truth directions with each additional operation.
- Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
- Demonstrates that early-layer probes capture sentence polarity rather than truth.
- Key improvement in cross-task generalization enabled by explicit instruction framing.
- Core empirical finding about layer-dependent truth direction emergence across task types.
- Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
- Geometric evidence for convergence to stable truth directions only for simpler tasks.
- Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans