paper
active
2024
paper:arora-causalgym-benchmarking-causal-interpreta-2024

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

TL;DR

CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-means, LDA, PCA, k-means, and random baselines in causally influencing language model behavior via 1D distributed interchange intervention. Across the pythia model family (14M–6.9B parameters), DAS achieves an average log odds-ratio of 10.74 on pythia-1b compared to 3.66 for probing and 3.17 for difference-in-means, measured over 400 training and 100 evaluation examples per task. However, when selectivity is computed by subtracting performance on control tasks that require arbitrary token mappings—an adaptation of Hewitt and Liang's (2019) probing control paradigm—the gap between DAS and probing narrows substantially, revealing that DAS's advantage partly reflects its expressivity rather than genuine causal alignment. Applying DAS to track training checkpoints of pythia-1b on NPI licensing (npi_any_subj-relc) and filler-gap dependencies (filler_gap_subj) shows that the causal mechanism for both phenomena emerges in discrete stages—not gradually—with information traversing multiple intermediate token positions before reaching the output, and both mechanisms appearing fully only after step 2000–3000 of training. The paper argues this implies that interpretability evaluation requires causal interventional paradigms rather than behavioral or representational proxies alone, and that psycholinguistic LM research should move beyond surprisal comparisons toward mechanistic analysis.

What to take away

  1. 1. DAS achieves an average log odds-ratio of 10.74 on pythia-1b across all 29 CausalGym tasks, compared to 3.66 for linear probing and 3.17 for difference-in-means, making it the most causally efficacious feature-finding method benchmarked.
  2. 2. CausalGym is a 29-task benchmark derived by templatically expanding SyntaxGym's test suites—covering agreement, licensing, garden-path effects, gross syntactic state, and long-distance dependencies—so that hundreds of aligned minimal pairs can be generated for supervised training of interpretability methods.
  3. 3. When selectivity (odds-ratio on the original task minus odds-ratio on a control task with arbitrary token labels) is used instead of raw odds-ratio, the advantage of DAS over probing is substantially reduced, with probing scoring 4.24 versus DAS's 4.24 on selectivity for pythia-1b, indicating DAS's raw superiority partly reflects its expressivity rather than genuine causal alignment.
  4. 4. The NPI licensing mechanism in pythia-1b emerges in discrete stages: a causal effect first appears at step 1000, an abrupt reorganization occurs at step 2000 when the auxiliary verb becomes important at middle layers, and a further intermediate position at the complementiser 'that' is added at step 3000.
  5. 5. The filler-gap dependency mechanism in pythia-1b takes longer to learn than NPI licensing, emerging in two stages: an initial mechanism including the filler position and final token at step 2000, followed by addition of the main verb after step 10K.
  6. 6. For both NPI licensing and filler-gap dependencies, the final pythia-1b mechanism routes information through multiple intermediate token positions across layers—e.g., negation moves to the complementiser in early layers, then to the auxiliary, then to the main verb—indicating multi-step information movement rather than direct feature propagation.
  7. 7. LDA, despite being a supervised method, barely outperforms random feature vectors in the CausalGym benchmarking, scoring 0.29 on pythia-1b versus 0.03 for random, while unsupervised PCA and k-means score around 2.07–2.13.
  8. 8. An open question the paper raises is why L2 regularization increases both probe accuracy and probe selectivity (causal efficacy minus control-task efficacy), as observed in hyperparameter tuning experiments—this relationship between regularization and causal alignment is left unexplained.
  9. 9. To enable fair comparison, each CausalGym method is trained on 400 examples per task (200 original plus 200 base-source-swapped pairs) and evaluated on a non-overlapping set of 100 examples, with DAS trained for one epoch using the Adam optimizer at learning rate 5×10⁻³ with a linear warmup-then-decay schedule.
  10. 10. The pythia model series (14M to 6.9B parameters, all trained on identical data in identical order with available checkpoints) provides a controlled substrate for studying both scale effects and training dynamics, with average task accuracy rising from 0.62 at 14M to 0.89 at 6.9B parameters.

Peer brief — for seminar discussion

CausalGym converts SyntaxGym's targeted syntactic evaluation paradigm into a causal interpretability benchmark by generating large numbers of aligned minimal pairs from 29 linguistic tasks—spanning subject-verb agreement, NPI licensing, filler-gap dependencies, garden-path effects, and gross syntactic state—and using them to train and evaluate seven feature-finding methods on their ability to causally shift model behavior via 1D distributed interchange intervention (1D DII). The core instrument, borrowed from Geiger et al.'s distributed alignment search (DAS), learns a one-dimensional direction in the residual stream that, when used to replace the base model's representation with a transformed version derived from a source input, maximally increases the probability of the counterfactual output label. This is contrasted with linear probing, difference-in-means, LDA, PCA, k-means, and random baselines, all evaluated on the same log odds-ratio metric across the pythia family of models from 14M to 6.9B parameters. The load-bearing finding is that DAS achieves substantially higher raw causal efficacy than all other methods—an average log odds-ratio of 10.74 on pythia-1b versus 3.66 for probing—but once a selectivity correction is applied (subtracting performance on control tasks requiring arbitrary '_dog'/'_give' token mappings, adapted from Hewitt and Liang's 2019 probing control paradigm), the DAS advantage collapses considerably, with selectivity scores of approximately 4.24 for both DAS and probing at the pythia-1b scale. The paper also applies DAS to training checkpoints of pythia-1b to trace the learning dynamics of NPI licensing (npi_any_subj-relc) and filler-gap extraction (filler_gap_subj), finding that both mechanisms emerge discontinuously in two or three abrupt stages rather than gradually, and that both involve multi-step routing of information across token positions and layers before reaching the output. The implication the paper draws is that causal interventional evaluation—rather than behavioral surprisal comparisons or representational probing accuracy alone—is the appropriate standard for interpretability methods, and that computational psycholinguists studying LMs should adopt this paradigm to move beyond input-output characterizations toward mechanistic understanding. An alternative method the benchmark could have used is activation patching (causal scrubbing or path patching), which would allow multi-component and circuit-level attributions rather than the single 1D subspace approach adopted here. The most contestable aspect is the claim that DAS's reduced selectivity advantage vindicates approximate parity with probing: the selectivity metric depends on the specific arbitrary-mapping control task chosen ('_dog'/'_give'), and a critical reader would push back on whether this control adequately captures the full range of DAS's expressive excess—particularly given the paper's own acknowledgment that DAS finds significant causal effect even on randomly initialized models, a result corroborating Wu et al. (2023). The benchmark is also restricted to English, to one-dimensional linear subspaces, and to a single model family trained on a fixed data order, leaving open whether the discrete-stage learning pattern and the DAS-vs-probe ordering generalize to other architectures or training regimes.

Frameworks (3)

  • CausalGym
    Multi-task benchmark of linguistic behaviours for measuring causal efficacy of interpretability methods, adapted from SyntaxGym
  • Linear Representation Hypothesis
    The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
  • Targeted syntactic evaluation
    Benchmarking paradigm using minimally-different grammatical sentence pairs to test LM linguistic competence

Findings (15)

Hypotheses (1)

Questions (7)

Original abstract (expand)

Language models (LMs) have proven to be powerful tools for psycholinguistic research, but most prior work has focused on purely behavioural measures (e.g., surprisal comparisons). At the same time, research in model interpretability has begun to illuminate the abstract causal mechanisms shaping LM behavior. To help bring these strands of research closer together, we introduce CausalGym. We adapt and expand the SyntaxGym suite of tasks to benchmark the ability of interpretability methods to causally affect model behaviour. To illustrate how CausalGym can be used, we study the pythia models (14M--6.9B) and assess the causal efficacy of a wide range of interpretability methods, including linear probing and distributed alignment search (DAS). We find that DAS outperforms the other methods, and so we use it to study the learning trajectory of two difficult linguistic phenomena in pythia-1b: negative polarity item licensing and filler--gap dependencies. Our analysis shows that the mechanism implementing both of these tasks is learned in discrete stages, not gradually.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+19 more

Similar preprints — Semantic Scholar

Cited by (3)