finding

active

finding:das-consistently-finds-the-most-causally-efficacious-features-across-all-pythia-model-sizes-in-causalgym

DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGym

Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Neighborhood — ranked by edge-count

Claims (1)

claim

DAS's access to model outputs during training is responsible for much of its advantage over other interpretability methods
supports
Author interpretation of selectivity results showing DAS advantage diminishes when controlling for expressivity

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LDA barely outperforms random features across all pythia model sizes in CausalGymfinding0.894
Surprising negative result for LDA despite being a supervised method
CausalGym results may differ on models trained on different data or in different orders beyond the pythia seriesquestion0.837
Identified limitation about generalizability across model training regimes
DAS achieves overall odds-ratio of 10.24 on pythia-410m averaged across all CausalGym tasksfinding0.833
Numerical result for pythia-410m
Smaller fully trained Pythia models (31M, 70M) show slightly reduced alignment accuracy compared to larger models despite non-linear mapsfinding0.789
Attributed to model anisotropy from saturation making hidden states harder to access
DAS trainable intervention finds sparser gender representations across layers compared to linear probe in Pythia-6.9Bfinding0.785
Case Study II result showing DAS identifies fewer causally relevant positions than a probe
DAS finds causal effect at all training timesteps including when model is just initialisedfinding0.784
Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates
Task accuracy on CausalGym increases consistently with model scale from 0.62 (14M) to 0.89 (6.9B)finding0.767
Scaling result showing larger pythia models perform better on CausalGym linguistic tasks
CausalGym only includes English data; comparable experiments with other languages might yield substantially different resultsquestion0.764
Identified limitation/gap calling for cross-lingual extension of CausalGym