finding
active
finding:task-accuracy-on-causalgym-increases-consistently-with-model-scale-from-0-62-14m-to-0-89-6-9bTask accuracy on CausalGym increases consistently with model scale from 0.62 (14M) to 0.89 (6.9B)
Scaling result showing larger pythia models perform better on CausalGym linguistic tasks
Source paper
extracted_from(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts
Neighborhood — ranked by edge-count
Claims (1)
claim
- Forward-looking assertion in conclusion about the lasting value of causal evaluation
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Quantifies performance cost of fine-tuning and steering; deployment steering has minimal accuracy cost.
- DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGymfinding0.767Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random
- DAS achieves overall odds-ratio of 10.24 on pythia-410m averaged across all CausalGym tasksfinding0.760Numerical result for pythia-410m
- Identified limitation calling for broader task coverage in future work
- DAS learning rate of 5e-3 outperforms 1e-3 (used in Wu et al. 2023) for small training sets in CausalGymfinding0.759Hyperparameter tuning result for DAS; different from prior work due to smaller training set size
- CausalGym results may differ on models trained on different data or in different orders beyond the pythia seriesquestion0.759Identified limitation about generalizability across model training regimes
- Surprising negative result for LDA despite being a supervised method
- Key improvement in cross-task generalization enabled by explicit instruction framing.