finding
active
finding:probes-trained-on-h-b-activations-achieve-perfect-test-accuracy-in-every-case-h-s-probes-achieve-perfect-accuracy-in-only-0-60-of-casesProbes trained on h_b activations achieve perfect test accuracy in every case; h_s probes achieve perfect accuracy in only 0.60% of cases
Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
Source paper
extracted_from(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara
Neighborhood — ranked by edge-count
Claims (1)
claim
- Explanation for why probes on h_b achieve perfect accuracy but h_s probes succeed only 0.60% of the time
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Striking cross-domain generalization result supporting the claim that larger models represent abstract truth
- Shows rapid generalization decay for arithmetic truth directions with each additional operation.
- Shows the passive vs. active divide is more important than the specific wording of instructions.
- Shows that truth representations are not reducible to text probability representations
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
- Shows the key divide is passive vs. active framing, not the specific wording of instructions.
- Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
- Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans