claim
active
claim:a-probe-achieving-high-classification-accuracy-provides-no-guarantee-that-the-model-actually-distinguishes-those-classes-in-downstream-computationsA probe achieving high classification accuracy provides no guarantee that the model actually distinguishes those classes in downstream computations
Motivation for causal evaluation over purely behavioural probing accuracy
Source paper
extracted_from(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
- Question raised by the discrepancy between DAS IIA and linear probe accuracy in Case Study II
- Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
- Opus 4.1 demonstrates highest introspective awareness on abstract nouns (justice, peace, betrayal) with nonzero awareness across all concept categories tested.
- Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
- Demonstrates that early-layer probes capture sentence polarity rather than truth.
- Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
- The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.767Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful