claim

active

claim:a-probe-achieving-high-classification-accuracy-provides-no-guarantee-that-the-model-actually-distinguishes-those-classes-in-downstream-computations

A probe achieving high classification accuracy provides no guarantee that the model actually distinguishes those classes in downstream computations

Motivation for causal evaluation over purely behavioural probing accuracy

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Neighborhood — ranked by edge-count

Papers (1)

paper

CausalGym: Benchmarking causal interpretability methods on linguistic tasks
introduces

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

A probe may achieve high performance even on representations that are not causally relevant for the taskclaim0.861
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
Are high-accuracy probe representations also causally relevant for the task?question0.832
Question raised by the discrepancy between DAS IIA and linear probe accuracy in Case Study II
Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputsclaim0.781
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
Models more effective at recognizing abstract nouns than other concept typesfinding0.777
Opus 4.1 demonstrates highest introspective awareness on abstract nouns (justice, peace, betrayal) with nonzero awareness across all concept categories tested.
Probes trained on h_b activations achieve perfect test accuracy in every case; h_s probes achieve perfect accuracy in only 0.60% of casesfinding0.773
Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.finding0.770
Demonstrates that early-layer probes capture sentence polarity rather than truth.
MM probes trained on larger_than+smaller_than achieve lower NIE than those trained on cities+neg_cities despite higher classification accuracy on sp_en_transfinding0.768
Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.767
Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful