CausalGym

Multi-task benchmark of linguistic behaviours for measuring causal efficacy of interpretability methods, adapted from SyntaxGym

Neighborhood — ranked by edge-count

paper

method

Distributed Alignment Search
uses
The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
Principal components analysis (PCA)
uses
Statistical method used to analyze neural activity data.
1D Distributed Interchange Intervention (1D DII)
uses
Core intervention method used throughout CausalGym; operates on one-dimensional non-basis-aligned subspace of activation space
Difference-in-Means
uses
Method for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis
Linear Probing
uses
Used to evaluate representation quality across VTAB tasks
k-means clustering
uses
Unsupervised feature-finding method using cluster centroid difference as feature direction
Linear Discriminant Analysis
uses
IID mass-mean probing coincides with LDA when covariance is known; used to derive the corrected probe formula
Selectivity
uses
Adapted control task metric measuring difference between odds-ratio on original task and arbitrary-label control task
Log odds-ratio
uses
Primary evaluation metric measuring causal effect of interventions; greater value indicates larger causal effect

framework

Targeted syntactic evaluation
implements
Benchmarking paradigm using minimally-different grammatical sentence pairs to test LM linguistic competence
SyntaxGym
extends
Online platform for targeted evaluation of language models that CausalGym adapts

dataset

Pythia model series
uses
Suite of 10 language models from 14M to 12B parameters trained on same data in same order, used for all experiments

artifact

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

CausalGym only includes English data; comparable experiments with other languages might yield substantially different resultsquestion0.792
Identified limitation/gap calling for cross-lingual extension of CausalGym
CausalGym covers only linguistic tasks; benchmarking interpretability methods on non-linguistic behaviours remains openquestion0.790
Identified limitation calling for broader task coverage in future work
Causal Geometryframework0.772
Chvykov and Hoel's geometric extension of causal emergence to continuous systems using Fisher information.
Causal Mediationconcept0.761
Whether an internal direction causally controls a target behavior, verified by intervention success
Acyclic Causal Modelconcept0.753
Consists of input, intermediate, and output variables with associated causal mechanisms; the mathematical object central to DAS.
Deterministic Causal Modelconcept0.742
Formal representation of algorithms as directed acyclic graphs computing functions f_A
Causal graph formalism (Wolfram physics style)framework0.741
Janus proposes transformer computation viewed as causal graph with foliations/time-slices specifying computation order.
Causal Mechanismconcept0.738
Function determining the value of a variable based on its causal parents in an acyclic causal model.