finding

active

finding:task-accuracy-on-causalgym-increases-consistently-with-model-scale-from-0-62-14m-to-0-89-6-9b

Task accuracy on CausalGym increases consistently with model scale from 0.62 (14M) to 0.89 (6.9B)

Scaling result showing larger pythia models perform better on CausalGym linguistic tasks

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Neighborhood — ranked by edge-count

Claims (1)

claim

The causal evaluation paradigm will continue to be useful for interpretability research regardless of which specific methods prevail
supports
Forward-looking assertion in conclusion about the lasting value of causal evaluation

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Model organism accuracy on BigCodeBench: 56.8% (pre-fine-tuned: 59.1%); steering toward deployment: 55.2%; steering toward evaluation: 43.1%finding0.772
Quantifies performance cost of fine-tuning and steering; deployment steering has minimal accuracy cost.
DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGymfinding0.767
Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random
DAS achieves overall odds-ratio of 10.24 on pythia-410m averaged across all CausalGym tasksfinding0.760
Numerical result for pythia-410m
CausalGym covers only linguistic tasks; benchmarking interpretability methods on non-linguistic behaviours remains openquestion0.760
Identified limitation calling for broader task coverage in future work
DAS learning rate of 5e-3 outperforms 1e-3 (used in Wu et al. 2023) for small training sets in CausalGymfinding0.759
Hyperparameter tuning result for DAS; different from prior work due to smaller training set size
CausalGym results may differ on models trained on different data or in different orders beyond the pythia seriesquestion0.759
Identified limitation about generalizability across model training regimes
LDA barely outperforms random features across all pythia model sizes in CausalGymfinding0.755
Surprising negative result for LDA despite being a supervised method
Under ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.finding0.754
Key improvement in cross-task generalization enabled by explicit instruction framing.