CausalGym: Benchmarking causal interpretability methods on linguistic tasks

ByAryaman Arora·Dan Jurafsky ⓘ·Christopher Potts ⓘStanford University

DOI 10.48550/arxiv.2402.12560 arXiv 2402.12560 OpenAlex W4392019764

Causal abstraction CausalGym Filler-gap dependency Linear Representation Hypothesis Mechanistic Interpretability Targeted syntactic evaluation Negative polarity item licensing Residual Stream

TL;DR

CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-means, LDA, PCA, k-means, and random baselines in causally influencing language model behavior via 1D distributed interchange intervention. Across the pythia model family (14M–6.9B parameters), DAS achieves an average log odds-ratio of 10.74 on pythia-1b compared to 3.66 for probing and 3.17 for difference-in-means, measured over 400 training and 100 evaluation examples per task. However, when selectivity is computed by subtracting performance on control tasks that require arbitrary token mappings—an adaptation of Hewitt and Liang's (2019) probing control paradigm—the gap between DAS and probing narrows substantially, revealing that DAS's advantage partly reflects its expressivity rather than genuine causal alignment. Applying DAS to track training checkpoints of pythia-1b on NPI licensing (npi_any_subj-relc) and filler-gap dependencies (filler_gap_subj) shows that the causal mechanism for both phenomena emerges in discrete stages—not gradually—with information traversing multiple intermediate token positions before reaching the output, and both mechanisms appearing fully only after step 2000–3000 of training. The paper argues this implies that interpretability evaluation requires causal interventional paradigms rather than behavioral or representational proxies alone, and that psycholinguistic LM research should move beyond surprisal comparisons toward mechanistic analysis.

What to take away

1. DAS achieves an average log odds-ratio of 10.74 on pythia-1b across all 29 CausalGym tasks, compared to 3.66 for linear probing and 3.17 for difference-in-means, making it the most causally efficacious feature-finding method benchmarked.
2. CausalGym is a 29-task benchmark derived by templatically expanding SyntaxGym's test suites—covering agreement, licensing, garden-path effects, gross syntactic state, and long-distance dependencies—so that hundreds of aligned minimal pairs can be generated for supervised training of interpretability methods.
3. When selectivity (odds-ratio on the original task minus odds-ratio on a control task with arbitrary token labels) is used instead of raw odds-ratio, the advantage of DAS over probing is substantially reduced, with probing scoring 4.24 versus DAS's 4.24 on selectivity for pythia-1b, indicating DAS's raw superiority partly reflects its expressivity rather than genuine causal alignment.
4. The NPI licensing mechanism in pythia-1b emerges in discrete stages: a causal effect first appears at step 1000, an abrupt reorganization occurs at step 2000 when the auxiliary verb becomes important at middle layers, and a further intermediate position at the complementiser 'that' is added at step 3000.
5. The filler-gap dependency mechanism in pythia-1b takes longer to learn than NPI licensing, emerging in two stages: an initial mechanism including the filler position and final token at step 2000, followed by addition of the main verb after step 10K.
6. For both NPI licensing and filler-gap dependencies, the final pythia-1b mechanism routes information through multiple intermediate token positions across layers—e.g., negation moves to the complementiser in early layers, then to the auxiliary, then to the main verb—indicating multi-step information movement rather than direct feature propagation.
7. LDA, despite being a supervised method, barely outperforms random feature vectors in the CausalGym benchmarking, scoring 0.29 on pythia-1b versus 0.03 for random, while unsupervised PCA and k-means score around 2.07–2.13.
8. An open question the paper raises is why L2 regularization increases both probe accuracy and probe selectivity (causal efficacy minus control-task efficacy), as observed in hyperparameter tuning experiments—this relationship between regularization and causal alignment is left unexplained.
9. To enable fair comparison, each CausalGym method is trained on 400 examples per task (200 original plus 200 base-source-swapped pairs) and evaluated on a non-overlapping set of 100 examples, with DAS trained for one epoch using the Adam optimizer at learning rate 5×10⁻³ with a linear warmup-then-decay schedule.
10. The pythia model series (14M to 6.9B parameters, all trained on identical data in identical order with available checkpoints) provides a controlled substrate for studying both scale effects and training dynamics, with average task accuracy rising from 0.62 at 14M to 0.89 at 6.9B parameters.

Peer brief — for seminar discussion

CausalGym converts SyntaxGym's targeted syntactic evaluation paradigm into a causal interpretability benchmark by generating large numbers of aligned minimal pairs from 29 linguistic tasks—spanning subject-verb agreement, NPI licensing, filler-gap dependencies, garden-path effects, and gross syntactic state—and using them to train and evaluate seven feature-finding methods on their ability to causally shift model behavior via 1D distributed interchange intervention (1D DII). The core instrument, borrowed from Geiger et al.'s distributed alignment search (DAS), learns a one-dimensional direction in the residual stream that, when used to replace the base model's representation with a transformed version derived from a source input, maximally increases the probability of the counterfactual output label. This is contrasted with linear probing, difference-in-means, LDA, PCA, k-means, and random baselines, all evaluated on the same log odds-ratio metric across the pythia family of models from 14M to 6.9B parameters. The load-bearing finding is that DAS achieves substantially higher raw causal efficacy than all other methods—an average log odds-ratio of 10.74 on pythia-1b versus 3.66 for probing—but once a selectivity correction is applied (subtracting performance on control tasks requiring arbitrary '_dog'/'_give' token mappings, adapted from Hewitt and Liang's 2019 probing control paradigm), the DAS advantage collapses considerably, with selectivity scores of approximately 4.24 for both DAS and probing at the pythia-1b scale. The paper also applies DAS to training checkpoints of pythia-1b to trace the learning dynamics of NPI licensing (npi_any_subj-relc) and filler-gap extraction (filler_gap_subj), finding that both mechanisms emerge discontinuously in two or three abrupt stages rather than gradually, and that both involve multi-step routing of information across token positions and layers before reaching the output. The implication the paper draws is that causal interventional evaluation—rather than behavioral surprisal comparisons or representational probing accuracy alone—is the appropriate standard for interpretability methods, and that computational psycholinguists studying LMs should adopt this paradigm to move beyond input-output characterizations toward mechanistic understanding. An alternative method the benchmark could have used is activation patching (causal scrubbing or path patching), which would allow multi-component and circuit-level attributions rather than the single 1D subspace approach adopted here. The most contestable aspect is the claim that DAS's reduced selectivity advantage vindicates approximate parity with probing: the selectivity metric depends on the specific arbitrary-mapping control task chosen ('_dog'/'_give'), and a critical reader would push back on whether this control adequately captures the full range of DAS's expressive excess—particularly given the paper's own acknowledgment that DAS finds significant causal effect even on randomly initialized models, a result corroborating Wu et al. (2023). The benchmark is also restricted to English, to one-dimensional linear subspaces, and to a single model family trained on a fixed data order, leaving open whether the discrete-stage learning pattern and the DAS-vs-probe ordering generalize to other architectures or training regimes.

Frameworks (3)

CausalGym
Multi-task benchmark of linguistic behaviours for measuring causal efficacy of interpretability methods, adapted from SyntaxGym
Linear Representation Hypothesis
The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
Targeted syntactic evaluation
Benchmarking paradigm using minimally-different grammatical sentence pairs to test LM linguistic competence

Findings (15)

NPI mechanism in pythia-1b moves negation feature through complementiser 'that', auxiliary verb, and main verb across layers before predicting NPI 'any'
Mechanistic finding from CausalGym case study showing multi-step information movement in NPI mechanism
DAS achieves substantial causal effect even on arbitrary input-output mappings where no causal mechanism should exist
Replication of Wu et al. 2023 finding; DAS expressivity concern validated in CausalGym setup
L2 regularisation with bias term delivers best probe performance; L2 regularisation increases probe selectivity
Hyperparameter tuning result for probes; consistent with Hewitt and Liang 2019 finding
Filler-gap dependency mechanism in pythia-1b emerges in two discrete stages (steps 2000 and 10K) not gradually
Training dynamics finding showing filler-gap takes longer to learn than NPI licensing
Filler-gap mechanism in pythia-1b crosses over several different positions before arriving at output position
Mechanistic finding from CausalGym case study showing complex multi-step movement for filler-gap
DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGym
Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random
DAS learning rate of 5e-3 outperforms 1e-3 (used in Wu et al. 2023) for small training sets in CausalGym
Hyperparameter tuning result for DAS; different from prior work due to smaller training set size
NPI licensing mechanism in pythia-1b emerges in discrete stages (steps 1000, 2000, 3000) not gradually
Training dynamics finding showing abrupt rather than gradual emergence of NPI mechanism
Task accuracy on CausalGym increases consistently with model scale from 0.62 (14M) to 0.89 (6.9B)
Scaling result showing larger pythia models perform better on CausalGym linguistic tasks
Probe achieves selectivity of 4.20 on pythia-410m, slightly exceeding DAS selectivity of 3.96
Key result showing that for models larger than pythia-70m, probe selectivity matches or exceeds DAS selectivity

Claims (6)

For both NPI and filler-gap tasks, the model initially learns to move information directly from alternating token to output; intermediate steps are added later in training
Mechanistic interpretation of training dynamics in case studies
Given the linear representation hypothesis and binary linguistic features, 1D DII is sufficiently expressive for controlling model behaviour in CausalGym
Theoretical justification for the methodological choice of 1D DII throughout the benchmark
A probe achieving high classification accuracy provides no guarantee that the model actually distinguishes those classes in downstream computations
Motivation for causal evaluation over purely behavioural probing accuracy
The causal evaluation paradigm will continue to be useful for interpretability research regardless of which specific methods prevail
Forward-looking assertion in conclusion about the lasting value of causal evaluation
DAS's access to model outputs during training is responsible for much of its advantage over other interpretability methods
Author interpretation of selectivity results showing DAS advantage diminishes when controlling for expressivity
The mechanisms implementing NPI licensing and filler-gap dependencies are learned in discrete stages, not gradually
Main mechanistic finding from case studies; evidence from training checkpoint analysis of pythia-1b

Hypotheses (1)

Understanding how LMs learn linguistic behaviours may offer insights into fundamental properties of language
Forward-looking hypothesis linking LM mechanism analysis to linguistic theory

Questions (7)

How much of the causal effect found by DAS is due to its expressivity rather than any aspect of the representation being studied?
Core methodological question motivating the introduction of selectivity and control tasks
CausalGym only includes English data; comparable experiments with other languages might yield substantially different results
Identified limitation/gap calling for cross-lingual extension of CausalGym
Would comparable experiments with other languages yield substantially different results about causal mechanisms LMs learn?
Limitation question about generalizability of CausalGym findings beyond English
CausalGym covers only linguistic tasks; benchmarking interpretability methods on non-linguistic behaviours remains open
Identified limitation calling for broader task coverage in future work
CausalGym results may differ on models trained on different data or in different orders beyond the pythia series
Identified limitation about generalizability across model training regimes
Multi-dimensional linear and non-linear interpretability methods have not been benchmarked on CausalGym
Identified gap in benchmark coverage; only 1D linear methods are benchmarked
Why does L2 regularisation increase probe causal efficacy (selectivity)?
Open question identified in hyperparameter tuning experiments, left for future work

Original abstract (expand)

Language models (LMs) have proven to be powerful tools for psycholinguistic research, but most prior work has focused on purely behavioural measures (e.g., surprisal comparisons). At the same time, research in model interpretability has begun to illuminate the abstract causal mechanisms shaping LM behavior. To help bring these strands of research closer together, we introduce CausalGym. We adapt and expand the SyntaxGym suite of tasks to benchmark the ability of interpretability methods to causally affect model behaviour. To illustrate how CausalGym can be used, we study the pythia models (14M--6.9B) and assess the causal efficacy of a wide range of interpretability methods, including linear probing and distributed alignment search (DAS). We find that DAS outperforms the other methods, and so we use it to study the learning trajectory of two difficult linguistic phenomena in pythia-1b: negative polarity item licensing and filler--gap dependencies. Our analysis shows that the mechanism implementing both of these tasks is learned in discrete stages, not gradually.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
cited
in corpus
2023
≈ 86%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
cited
in corpus
2023
≈ 80%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 85%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 85%
Causal-Symbolic Meta-Learning (CSML): Inducing Causal World Models for Few-Shot Generalization
Mohamed Zayaan S
2025
≈ 83%
Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations
Ajay Pravin Mahale
2026
≈ 83%
Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions
Usman Naseem
2026
≈ 83%
The Guanyin Protocol: A Framework for Immediately Establishing an Understanding of Both Causality and Compassion in LLM Systems Using Semantic Anchoring
in corpus
2025
≈ 83%
Model Alignment Search
in corpus
2025
≈ 83%
Inference Time Causal Probing in LLMs
Saber Salehkaleybar, Negar Kiyavash, Matthias Grossglauser Sadegh Khorasani
2026
≈ 83%
Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units
Yuzhang Luo, Liangming Pan Jianhui Chen
2026
≈ 82%
Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models
Yonatan Belinkov, Hanjie Chen, Najoung Kim, Hosein Mohebbi, Aaron Mueller, Gabriele Sarti, Martin Tutek Dana Arad
2025
≈ 82%
Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics
Muhammad Zaeem Khan, Aleesha Zainab, Saleha Jamshed, Sadia Ahmad, Kaynat Khatib, Faria Bibi, and Abdul Rehman Asifullah Khan
2026
≈ 82%
From Insight to Action: A Novel Framework for Interpretability-Guided Data Selection in Large Language Models
Xinwei Wu, Xiaohu Zhao, Hao Wang, Heng Liu, Yangyang Liu, Linlong Xu, Longyue Wang, Deyi Xiong, Weihua Luo Ling Shi
2026
≈ 82%
Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective
Zubair Bashir, Procheta Sen Bhavik Chandna
2025
≈ 82%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 82%
Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers
Henry Conklin, Yukang Yang, Thomas Griffiths, Jonathan Cohen, Sarah-Jane Leslie Andrew Nam
2025
≈ 82%
Alignment faking in large language models
in corpus
2024
≈ 82%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 82%
Inducing Causal World Models in LLMs for Zero-Shot Physical Reasoning
Ananya Gupta, Chengyu Wang, Chiamaka Adebayo, and Jakub Kowalski Aditya Sharma
2025
≈ 82%
Constructing Interpretable Features from Compositional Neuron Groups
Atticus Geiger, Mor Geva Or Shafran
2026
≈ 82%
Layer-Specific Fine-Tuning for Improved Negation Handling in Medical Vision-Language Models
Mehdi Taghipour, Rahmatollah Beheshti Ali Abbasi
2026
≈ 82%
Steering Conceptual Bias via Transformer Latent-Subspace Activation
Vansh Sharma and Venkat Raman
2025
≈ 82%
A Multi-Level Causal Intervention Framework for Mechanistic Interpretability in Variational Autoencoders
Rajiv Misra, Sanjay Kumar Singh, Anisha Roy Dip Roy
2026
≈ 81%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 81%
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
in corpus
2026
≈ 81%
A geometric notion of causal probing
cited
2023
≈ 79%
A Mathematical Framework for Transformer Circuits
cited
2021
≈ 73%
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
cited
2024
≈ 73%
Causal abstraction: A theoretical foundation for mechanistic interpretability
cited
2025
≈ 68%

+19 more

Similar preprints — Semantic Scholar

Cited by (3)

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assump
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as