finding

active

finding:probe-achieves-selectivity-of-4-20-on-pythia-410m-slightly-exceeding-das-selectivity-of-3-96

Probe achieves selectivity of 4.20 on pythia-410m, slightly exceeding DAS selectivity of 3.96

Key result showing that for models larger than pythia-70m, probe selectivity matches or exceeds DAS selectivity

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Neighborhood — ranked by edge-count

Claims (1)

claim

DAS's access to model outputs during training is responsible for much of its advantage over other interpretability methods
supports
Author interpretation of selectivity results showing DAS advantage diminishes when controlling for expressivity

Questions (1)

question

How much of the causal effect found by DAS is due to its expressivity rather than any aspect of the representation being studied?
answered_by
Core methodological question motivating the introduction of selectivity and control tasks

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Linear probe achieves 100% classification accuracy for almost all components in Pythia-6.9B gender taskfinding0.808
Demonstrates that linear probes can overestimate causal relevance; probes succeed on non-causally-relevant representations
pythia-14m achieves only 0.38 accuracy on npi_ever_subj-relc taskfinding0.796
Baseline accuracy showing small models fail on harder NPI licensing tasks
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesfinding0.771
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
Probes trained on h_b activations achieve perfect test accuracy in every case; h_s probes achieve perfect accuracy in only 0.60% of casesfinding0.763
Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
DAS achieves overall odds-ratio of 10.24 on pythia-410m averaged across all CausalGym tasksfinding0.759
Numerical result for pythia-410m
DAS trainable intervention finds sparser gender representations across layers compared to linear probe in Pythia-6.9Bfinding0.757
Case Study II result showing DAS identifies fewer causally relevant positions than a probe
Impulsivity probe: peak Cohen's d=3.60 (layer 13), p=3.58×10⁻¹³ in LLaMA-3.2-3Bfinding0.755
Strongest probe validation result; highest Cohen's d among the four concepts
F3-trained probes achieve AUROC ~0.6 on F4, showing generalization breakdown from counting over 2 to 5 cities.finding0.752
Demonstrates the sharp drop in factual truth generalization at the counting boundary.