claim
active
claim:probes-trained-under-different-explicit-instruction-variants-are-highly-aligned-with-each-other-despite-different-wordingProbes trained under different explicit instruction variants are highly aligned with each other despite different wording.
Shows the key divide is passive vs. active framing, not the specific wording of instructions.
Source paper
extracted_from(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi
Neighborhood — ranked by edge-count
Findings (1)
finding
- Shows the passive vs. active divide is more important than the specific wording of instructions.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
- Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy
- Key improvement in cross-task generalization enabled by explicit instruction framing.
- Probe-based method bridges interpretability (probes/activations) with data-centric alignment workclaim0.785Assertion from the paper's notes that the work connects two previously separate areas: interpretability tools and data-centric alignment.
- Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
- Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
- Question raised by the discrepancy between DAS IIA and linear probe accuracy in Case Study II
- Dissociation between classification accuracy and causal implication; training on opposites does not always help causally