finding

active

finding:das-achieves-overall-odds-ratio-of-10-24-on-pythia-410m-averaged-across-all-causalgym-tasks

DAS achieves overall odds-ratio of 10.24 on pythia-410m averaged across all CausalGym tasks

Numerical result for pythia-410m

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGymfinding0.833
Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random
DAS learning rate of 5e-3 outperforms 1e-3 (used in Wu et al. 2023) for small training sets in CausalGymfinding0.790
Hyperparameter tuning result for DAS; different from prior work due to smaller training set size
LDA barely outperforms random features across all pythia model sizes in CausalGymfinding0.783
Surprising negative result for LDA despite being a supervised method
Task accuracy on CausalGym increases consistently with model scale from 0.62 (14M) to 0.89 (6.9B)finding0.760
Scaling result showing larger pythia models perform better on CausalGym linguistic tasks
Probe achieves selectivity of 4.20 on pythia-410m, slightly exceeding DAS selectivity of 3.96finding0.759
Key result showing that for models larger than pythia-70m, probe selectivity matches or exceeds DAS selectivity
DAS runs in 502 seconds for hierarchical equality vs. estimated 6e8 seconds for exhaustive brute-force searchfinding0.753
DAS runtime is invariant with number of testing hypotheses, unlike brute-force search.
MDS achieves global win proportion of 89.5% on SJTs across 14 LLMs and four injection stridesfinding0.751
MDS dominates in open-ended generation by global win proportion metric (Table 2)
pythia-14m achieves only 0.38 accuracy on npi_ever_subj-relc taskfinding0.742
Baseline accuracy showing small models fail on harder NPI licensing tasks