Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

ByAtticus Geiger ⓘ·Zhengxuan Wu ⓘ·Christopher Potts ⓘ·Thomas Icard·Noah D. Goodman ⓘGoodfire, Harvard University + 2 more

DOI 10.48550/arxiv.2303.02536 arXiv 2303.02536 OpenAlex W4323557327

Distributed Neural Representations Parallel Distributed Processing Framework Distributed Alignment Search Hierarchical Equality Training Data Explainable Artificial Intelligence Distributed Interchange Intervention MoNLI Dataset Faithfulness of Explanations Interchange Intervention Accuracy MultiNLI Dataset Hierarchical Equality Task Subspace DAS Monotonicity Natural Language Inference

TL;DR

Distributed alignment search (DAS) resolves two blocking limitations of prior causal abstraction work—brute-force alignment search and the localist assumption that high-level variables map to disjoint neuron sets—by using gradient descent over orthogonal rotation matrices to find alignments in non-standard bases of neural representations. On a hierarchical equality task, a three-layer feed-forward network with hidden size 16 achieves 100% interchange intervention accuracy (IIA) under DAS at layer 1 with an 8-dimensional intervention subspace, whereas the best brute-force localist search reaches only 0.60 IIA and the closest localist alignment only 0.73 IIA. On the Monotonicity NLI benchmark, BERT-base fine-tuned on MoNLI achieves 100% IIA at layer 9 when 256 non-standard basis dimensions of the [CLS] token encode lexical entailment and 256 others encode negation, while no localist alignment exceeds 0.51 IIA on the same task. A subsequent subspace decomposition reveals a structural asymmetry: the hierarchical equality representations of w=x and y=z cannot be decomposed into representations of individual input identities (subspace DAS IIA ≈ 0.50–0.51), whereas the apparent lexical-entailment representation in BERT decomposes almost perfectly (IIA ≈ 0.97–0.98) into two word-identity representations. DAS implies that previous negative or weak causal abstraction findings may have been artifacts of the localist assumption, and that neural networks can genuinely implement tree-structured symbolic algorithms—but that apparent relational representations may sometimes be data structures over entity identities rather than true relational encodings.

What to take away

1. DAS (distributed alignment search) finds alignments between high-level causal variables and distributed neural representations by optimizing an orthogonal rotation matrix with stochastic gradient descent rather than brute-force search over localist neuron subsets.
2. On the hierarchical equality task, DAS achieves 100% IIA at layer 1 of a 16-hidden-unit feed-forward network using an 8-dimensional intervention subspace, compared to 0.60 IIA for brute-force localist search and 0.73 IIA for the closest localist alignment.
3. For BERT-base fine-tuned on the MoNLI benchmark, DAS finds 100% IIA at layer 9 with a 256-dimensional intervention subspace for the joint negation-and-lexical-entailment high-level model, while all localist alignments remain at or below 0.51 IIA.
4. The learned rotation matrices are non-trivial: eigenvector rotation analyses show the majority of basis vectors are substantially rotated, indicating that high-level causal structure is genuinely distributed and not recoverable by standard neuron-aligned probes.
5. Subspace DAS applied to the hierarchical equality task finds that representations of w=x and y=z cannot be decomposed into representations of individual input identities (IIA ≈ 0.50–0.51), establishing that the network encodes abstract relational structure independent of the participating entities.
6. Subspace DAS applied to the MoNLI BERT model finds that the apparent lexical-entailment representation decomposes nearly perfectly into two word-identity representations (IIA ≈ 0.97–0.98 at layer 9), revealing it is a data structure over word identities rather than a true relational encoding.
7. DAS runtime for the MoNLI task is approximately 1,105 seconds, versus a tractable brute-force runtime of 198 seconds over a limited hypothesis set, but the brute-force worst-case combinatorial space is estimated at C(768,32) ≈ 2e58 hypotheses, making exhaustive search computationally infeasible.
8. To replicate DAS, one implements a differentiable orthogonal matrix parameterization (e.g., PyTorch's torch.nn.utils.parametrizations.orthogonal), freezes both low-level and high-level models, and minimizes cross-entropy between the high-level output distribution and the push-forward of the low-level output distribution under distributed interchange interventions.
9. Applying DAS to randomly initialized, chance-accuracy (50%) networks shows that IIA increases only when the hidden dimension is orders of magnitude larger than the input dimension (e.g., reaching 0.64 IIA only at hidden size 4096 for a 16-dimensional input), confirming that DAS cannot fabricate causal structure absent from the model.
10. An open question the paper raises is whether non-linear invertible transformations (e.g., normalizing flows) rather than orthogonal matrices would be required to find alignments when high-level variables are encoded in non-linear sub-manifolds of the representation space, which DAS in its current form cannot handle.

Peer brief — for seminar discussion

Geiger et al. (2024) address a foundational bottleneck in causal abstraction-based interpretability: prior methods require brute-force search over localist alignments—mappings from high-level causal variables to disjoint neuron subsets—making them both computationally intractable and structurally biased against the distributed representations widely hypothesized to characterize neural networks. The paper introduces distributed alignment search (DAS), which parameterizes the alignment as an orthogonal rotation matrix over a subspace of a neural layer's representation, then optimizes it with stochastic gradient descent using interchange intervention training objectives, with both the neural network and the high-level causal model frozen. An alternative approach the paper could have used is iterative nullspace projection (INLP), which also searches for linear subspaces encoding target concepts but does so adversarially rather than causally and would not directly optimize interchange intervention accuracy. The load-bearing empirical finding is a clean double dissociation. On a hierarchical equality task, a three-layer feed-forward network with hidden size 16 achieves 100% IIA under DAS at layer 1 with an 8-dimensional intervention subspace, while brute-force localist search plateaus at 0.60 IIA and the nearest localist re-projection at 0.73 IIA. On the Monotonicity NLI benchmark, BERT-base fine-tuned on MoNLI reaches 100% IIA at layer 9 with a 256-dimensional subspace encoding both negation and lexical entailment jointly, whereas no tested localist alignment exceeds 0.51 IIA. A further subspace decomposition (Subspace DAS) then reveals a structural difference between the two cases: the equality representations in the feed-forward network cannot be decomposed into individual entity-identity representations (decomposition IIA ≈ 0.50), whereas the apparent lexical-entailment representation in BERT decomposes nearly perfectly into two word-identity representations (IIA ≈ 0.97–0.98 at layer 9), indicating it is a data structure over lexeme identities rather than a genuine relational encoding. The paper's interpretive claim is that when 100% IIA is achieved and representations resist decomposition, the neural network literally implements a symbolic, tree-structured algorithm—not merely an approximation of one. This is framed as foundational for understanding the coexistence of symbolic and connectionist computation. The paper also implicitly predicts that many previously reported weak or null causal abstraction findings in the literature will prove to be artifacts of the localist assumption rather than genuine evidence of non-symbolic computation. A critical reader would push back on the scope of the experimental substrate. Both tasks—hierarchical equality on a toy three-layer MLP and MoNLI on a single BERT-base fine-tune—are constructed to have clean, known symbolic solutions with exactly two intermediate variables, and both models are trained to 100% accuracy before analysis begins. The claim that DAS scales to realistic large models and messy tasks remains undemonstrated: the paper itself acknowledges that rotating the full [CLS] representation of BERT-base across a concatenated token sequence would require approximately 15.4B parameters in the rotation matrix, which is intractable, and scaling is deferred to future work. A skeptic could reasonably argue that the 100% IIA results reflect the extreme simplicity and synthetic construction of the tasks rather than a general property of gradient-descent alignment search, and that the decomposability asymmetry—while striking—is observed in only two settings, one of which (BERT on MoNLI) is a fine-tuned model on a purpose-built dataset that may not generalize to naturalistic language understanding.

Methods (4)

Distributed Alignment Search
The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
Distributed Interchange Intervention
Extends interchange interventions to non-standard bases by rotating representations, intervening in rotated subspaces, then rotating back.
Interchange Intervention Accuracy
Proportion of aligned interchange interventions with equivalent high-level and low-level effects; graded measure of causal abstraction.
Subspace DAS
Extension of DAS that learns a second rotation matrix on top of a fixed first one to decompose representations into sub-representations.

Frameworks (1)

Parallel Distributed Processing Framework
The theoretical framework from Rumelhart, McClelland, and Smolensky (1986) identifying distributed representations in neural networks; theoretical precursor to DAS.

Datasets (3)

Hierarchical Equality Training Data
1.92M randomly generated input-output pairs used to train the feed-forward network on the hierarchical equality task.
MoNLI Dataset
Natural language inference dataset where premise-hypothesis pairs differ by a single word; used to evaluate DAS on BERT.
MultiNLI Dataset
BERT is first fine-tuned on MultiNLI before being fine-tuned on MoNLI in the NLI experiment.

Findings (12)

Lexical entailment representation decomposes into word identity sub-representations with ~0.97-0.98 IIA (Lexeme Subspace of Lexical Entailment)
In contrast to hierarchical equality, lexical entailment in BERT decomposes into representations of word identities, not a single abstract relation.
Identity Subspace of Left Equality model achieves ~0.50 IIA, indicating equality relations cannot be decomposed into input identities
DAS reveals that the network encodes abstract equality relations rather than storing identities of inputs.
Learned rotation matrices are non-trivial: majority of basis vectors are rotated, indicating highly distributed representations
Learned rotations reveal that direct probes over standard activation bases would miss the actual causal role of representations.
DAS on oversized randomly initialized network (|N|=4096 for 16-dim input) achieves 0.64 IIA by searching random structure
Shows that overly large hidden dimensions allow DAS to find random causal structures; calibration check.
DAS achieves 100% IIA for combined Negation and Lexical Entailment model on MoNLI at Layer 9, intervention size 256
Perfect abstraction relation between BERT and symbolic algorithm with negation and lexical entailment variables.
DAS on randomly initialized small networks (|N|=16) achieves only 0.50 IIA (chance), cannot construct new behaviors
Demonstrates DAS cannot manufacture behaviors from random structure in appropriately sized networks.
DAS runs in 502 seconds for hierarchical equality vs. estimated 6e8 seconds for exhaustive brute-force search
DAS runtime is invariant with number of testing hypotheses, unlike brute-force search.
Best localist alignment achieves IIA of 0.73 on hierarchical equality Both Equality Relations in Layer 1
Shows localist alignment fails to capture the distributed structure found by DAS.
Brute-force search achieves best IIA of 0.60 on hierarchical equality Both Equality Relations in Layer 1
DAS substantially outperforms brute-force search (1.00 vs 0.60 IIA) on the hierarchical equality task.
DAS achieves 100% IIA on hierarchical equality task with |N|=16, intervention size 8, Layer 1
DAS discovers a perfect alignment between the feed-forward network and the Both Equality Relations high-level model.

Claims (9)

The discovery of perfect abstract equality representations that cannot be decomposed into entity representations is a foundational result informing our understanding of how symbolic and connectionist architectures coexist
Concluding claim about theoretical significance of the hierarchical equality finding.
Causal abstraction theory is a unified framework that subsumes diverse intervention-based interpretability methods including LIME, causal mediation analysis, INLP, and circuit explanations
The paper endorses Geiger et al. 2023's claim that disparate interpretability methods are instances of causal abstraction.
The feed-forward network truly implements a symbolic, tree-structured algorithm for hierarchical equality, with abstract equality relations not decomposable into input identities
DAS reveals that the neural network encodes abstract relational structure rather than raw input identities.
What appears to be a representation of lexical entailment in BERT is actually a data structure of two word identity representations, not an encoding of the entailment relation
Key asymmetry between hierarchical equality and NLI experiments; BERT stores identities rather than the abstract relation.
Investigating the causal substructure of neural representations is necessary to avoid misidentifying data structures of simpler representations as abstract concepts
Motivated by the finding that lexical entailment decomposes into word identities.
There is a many-to-many mapping between neurons and concepts, meaning multiple high-level causal variables might be encoded in overlapping groups of neurons
Fundamental theoretical claim motivating DAS, attributed to Smolensky/Rumelhart/McClelland.
Direct probes over learned activations in standard basis may fail to reveal the actual causal role of representations because they are highly distributed
Supported by the finding that non-trivial rotations are required to find aligned representations.
DAS overcomes the localist limitation of prior causal abstraction by allowing individual neurons to play multiple roles via non-standard bases
Central claim motivating DAS over prior methods.
DAS finds better alignments than brute-force search by using gradient descent rather than exhaustive discrete search
Second central claim of the paper.

Hypotheses (1)

Larger hidden representations create more random structure that DAS can search through, allowing manipulation of counterfactual behavior even in randomly initialized networks
Tested in Section 4.4 calibration experiment; confirmed by findings.

Questions (4)

Can the distributed representation of lexical entailment be decomposed into representations of the individual word identities?
Research question leading to the key NLI finding about word identity data structures.
Does the hierarchical equality network implement a program that computes w=x and y=z as intermediate values?
Specific research question for the first experiment.
Can an interpretable symbolic algorithm be used to faithfully explain a complex neural network model?
Framing question for the paper's research program.
Does DAS scale with large foundation models?
Practical scalability question addressed in Appendix D.

Original abstract (expand)

Causal abstraction is a promising theoretical framework for explainable artificial intelligence that defines when an interpretable high-level causal model is a faithful simplification of a low-level deep learning system. However, existing causal abstraction methods have two major limitations: they require a brute-force search over alignments between the high-level model and the low-level one, and they presuppose that variables in the high-level model will align with disjoint sets of neurons in the low-level one. In this paper, we present distributed alignment search (DAS), which overcomes these limitations. In DAS, we find the alignment between high-level and low-level models using gradient descent rather than conducting a brute-force search, and we allow individual neurons to play multiple distinct roles by analyzing representations in non-standard bases-distributed representations. Our experiments show that DAS can discover internal structure that prior approaches miss. Overall, DAS removes previous obstacles to conducting causal abstraction analyses and allows us to find conceptual structure in trained neural nets.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 91%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 87%
Model Alignment Search
in corpus
2025
≈ 86%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 86%
PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction
Arya Datla, Ziv Goldfeld Jonathn Chang
2026
≈ 85%
Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions
Usman Naseem
2026
≈ 84%
Beyond Object-Level Alignment: Do Brains and DNNs Preserve the Same Transformations?
Yukiyasu Kamitani
2026
≈ 84%
Combining Causal Models for More Accurate Abstractions of Neural Networks
Sara Magliacane, Atticus Geiger Theodora-Mara P\^islar
2025
≈ 84%
Disentangling Polysemantic Neurons with a Null-Calibrated Polysemanticity Index and Causal Patch Interventions
Dhruv Kumar Manan Gupta
2025
≈ 84%
Constructing Interpretable Features from Compositional Neuron Groups
Atticus Geiger, Mor Geva Or Shafran
2026
≈ 83%
Beyond Geometry: Comparing the Temporal Structure of Computation in Neural Circuits with Dynamical Similarity Analysis
Adam Eisen, Leo Kozachkov, Ila Fiete Mitchell Ostrow
2023
≈ 83%
Causal Probing for Internal Visual Representations in Multimodal Large Language Models
Tianjie Ju, Zheng Wu, Liangbo He, Jun Lan, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang Zehao Deng
2026
≈ 83%
Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation
Gabriele Dominici, Marc Langheinrich Francesco Sovrano
2026
≈ 83%
A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima
Harshvardhan Saini, Zhaoqian Yao, Zheng Lin, Yizhen Liao, Jingyi Cui, Yisen Wang, Mengnan Du, Dianbo Liu Yiming Tang
2026
≈ 83%
Dynamical similarity analysis can identify compositional dynamics developing in RNNs
Micha{\l} W\'ojcik, Jascha Achterberg, Rui Ponte Costa Quentin Guilhot
2024
≈ 82%
Identifying Sub-networks in Neural Networks via Functionally Similar Representations
Amit Dhurandhar, Karthikeyan Natesan Ramamurthy, Dennis Wei Tian Gao
2025
≈ 82%
Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability
Tuomas Oikarinen, Tsui-Wei (Lily) Weng Ge Yan
2025
≈ 82%
Probing the Probes: Methods and Metrics for Concept Alignment
Marte Eggen, Inga Str\"umke Jacob Lysn{\ae}s-Larsen
2025
≈ 82%
The Platonic Representation Hypothesis
in corpus
2024
≈ 82%
Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion
Brian Cheung, Evelina Fedorenko, Alex H. Williams Eghbal A. Hosseini
2026
≈ 82%
The Causally Emergent Alignment Hypothesis: Causal Emergence Aligns with and Predicts Final Reward in Reinforcement Learning Agents
in corpus
2026
≈ 81%
Causal analysis of syntactic agreement mechanisms in neural language models
cited
2021
≈ 80%
The Guanyin Protocol: A Framework for Immediately Establishing an Understanding of Both Causality and Compassion in LLM Systems Using Semantic Anchoring
in corpus
2025
≈ 80%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 80%
Generalizing frameworks for sentience beyond natural species
in corpus
≈ 79%
Alignment faking in large language models
in corpus
2024
≈ 79%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 79%
Neural natural language inference models partially embed theories of lexical entailment and negation
cited
2020
≈ 78%
Zoom In: An Introduction to Circuits
cited
2020
≈ 76%
In-context Learning and Induction Heads
cited
2022
≈ 73%

+23 more

Similar preprints — Semantic Scholar

Cited by (6)

Addressing divergent representations from causal interventions on neural networks
Causal intervention methods central to mechanistic interpretability—including activation patching, mean-difference vector patching, Sparse Autoencoders, and Distributed Alignment Search (DAS)—systemat
Model Alignment Search
Model Alignment Search (MAS) establishes bidirectional causal similarity between neural networks by learning a per-model orthogonal rotation matrix that isolates behaviorally relevant subspaces and us
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assump
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
pyvene is an open-source Python library that unifies intervention-based research on PyTorch neural models by treating the intervention itself—rather than model surgery code—as the primitive abstractio
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-mean
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as