The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

ByDenis Sutter·Julian Minder·Thomas Hofmann·Tiago Pimentel

DOI 10.48550/arxiv.2507.08802 arXiv 2507.08802 OpenAlex W6947949709

Causal abstraction Distributed Alignment Search (DAS)Identity Alignment Map (ϕ_id)Distributive Law Task Dataset Interpretability Illusion Linear Representation Hypothesis Linear Alignment Map (ϕ_lin)Hierarchical Equality Task Dataset Neural Collapse Non-Linear Alignment Map (ϕ_nonlin)IOI Dataset (Muhia 2022)Non-Linear Representation Dilemma Pythia Suite Models Non-Linear Representation Hypothesis+3 more

TL;DR

Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assumptions (countable input-space, layer-wise input-injectivity, strict output-surjectivity, matchable partial-orderings, and task-solving). This triviality is demonstrated empirically using distributed alignment search (DAS) with reversible residual network (RevNet) alignment maps (ϕ_nonlin) on Pythia suite models ranging from 31M to 410M parameters: near-perfect interchange intervention accuracy (IIA) is achieved even for randomly initialised models on the indirect object identification (IOI) task, and over 80% IIA is reached on randomly initialised 3-layer MLPs in the hierarchical equality task. By contrast, linear alignment maps (ϕ_lin) track the model's actual learning trajectory, exhibiting layer-dependent degradation patterns—such as IIA collapse in layer 3 of the hierarchical equality MLP—that vanish entirely under ϕ_nonlin. This empirical asymmetry is the crux of what the paper terms the non-linear representation dilemma: lifting the linearity constraint that implicitly underlies DAS and related methods eliminates the principled basis for distinguishing genuine algorithmic implementation from spurious alignment, implying that causal abstraction is not sufficient for mechanistic interpretability and must be coupled with explicit, justified assumptions about how features are encoded in neural network representations.

What to take away

1. Theorem 1 proves that under five mild assumptions—including layer-wise input-injectivity and strict output-surjectivity—any algorithm A is an input-restricted distributed abstraction of any DNN N, making unrestricted causal abstraction vacuous.
2. Using RevNet-based non-linear alignment maps (ϕ_nonlin with L_rn=8, d_rn=64) applied to Pythia-410m, near-perfect IIA is achieved on the indirect object identification (IOI) task at every training step, including at random initialisation before any task learning occurs.
3. With linear alignment maps (ϕ_lin) on a 3-layer MLP in the hierarchical equality task, IIA for the both-equality-relations algorithm drops substantially at layer 3, a pattern that disappears completely when ϕ_nonlin (L_rn=10, d_rn=16) is used instead.
4. Randomly initialised 3-layer MLPs with hidden size 16 achieve over 80% IIA on the hierarchical equality task when the most complex ϕ_nonlin alignment map is used, despite these models being incapable of solving the task.
5. For Pythia models of all sizes tested (31M, 70M, 160M, 410M parameters), near-perfect IIA on the IOI task can be found with ϕ_nonlin even at random initialisation, including for the 31M and 70M models that never successfully learn the task after full training.
6. As Pythia-410m training progresses, progressively simpler alignment maps suffice to achieve perfect IIA on the IOI task: by full training, even a 1-layer ϕ_nonlin achieves perfect alignment, while at initialisation an 8-layer map is required.
7. When training and test sets for the IOI task use completely disjoint sets of names, ϕ_nonlin fails to generalise for randomly initialised Pythia-410m, achieving near-zero IIA, suggesting that spurious alignment on untrained models is name-memorisation rather than genuine structural understanding.
8. The paper introduces the non-linear representation dilemma: without a principled bound on alignment map complexity, causal abstraction analyses face an unsolvable accuracy-complexity tradeoff analogous to the previously unresolved probe complexity-accuracy debate in diagnostic probing.
9. Transformer layers (embedding, MLP+residual, attention+residual) are proven almost surely injective at initialisation when weights are drawn from any continuous distribution (Theorem 2), providing theoretical grounding for the input-injectivity assumption used in Theorem 1.
10. An open question the paper raises is whether a formal criterion—analogous to minimum description length probing or Pareto-optimal probing—can be defined to adjudicate the accuracy-complexity tradeoff for alignment maps ϕ in causal abstraction analyses.

Peer brief — for seminar discussion

Sutter et al. (NeurIPS 2025) ask whether causal abstraction, as formalised by Geiger et al. (2024b) and operationalised through distributed alignment search (DAS), is sufficient for mechanistic interpretability when no constraint is imposed on the alignment map ϕ that mediates between a DNN's hidden states and an algorithm's nodes. The paper proceeds in two registers: a formal proof and an empirical corroboration. Theorem 1 establishes that, under five assumptions—countable input-space, layer-wise input-injectivity, strict output-surjectivity, matchable partial-orderings between algorithm and DNN, and task-solving by the DNN—any algorithm A is an input-restricted distributed abstraction of any network N. The proof is constructive but existence-only: it exploits the uncountability of the real-valued hidden state space against the countability of input-restricted interventions to build alignment maps that can encode arbitrary target outputs, representing a form of extreme overfitting with no generalization guarantee. The empirical work validates that such maps are practically learnable. Using RevNet-based non-linear alignment maps (ϕ_nonlin, varied from L_rn=1 to L_rn=8 with d_rn up to 64) applied via DAS to the Pythia suite (31M–410M parameters) on the indirect object identification task, near-perfect interchange intervention accuracy (IIA) is achieved at random initialisation across all model sizes—including 31M and 70M models that never learn the task after training. On the hierarchical equality task using a 3-layer MLP with hidden size 16, randomly initialised models exceed 80% IIA with the most complex ϕ_nonlin, and the layer-dependent IIA degradation patterns observed with linear maps (ϕ_lin) vanish entirely under non-linear maps. The paper terms this the non-linear representation dilemma: relaxing the linearity constraint that implicitly underlies DAS destroys the signal that made DAS informative. The paper argues this implies causal abstraction cannot stand alone as a mechanistic interpretability method—it is only meaningful when paired with explicit, justified assumptions about representation encoding (the privileged bases, linear, or non-linear representation hypotheses). The paper introduces ϕ_nonlin (RevNet alignment maps) as its primary diagnostic instrument; an alternative it could have employed is HyperDAS (Sun et al., 2025), which automates node-subspace search via hypernetworks and might expose similar triviality from a different angle. A critical reader would push back on the following: the existence proof in Theorem 1 relies on input-injectivity across all layers, which the authors themselves acknowledge is violated in practice by phenomena like neural collapse and the softmax bottleneck, and which they only verify empirically via collision-counting on 1,280,000 samples for a single small MLP. The gap between 'almost surely injective at initialisation' (Theorem 2) and 'injective after training' is substantial—neural collapse is precisely a training-induced failure of injectivity—so the theorem's assumptions may not hold for the very trained models on which the empirical results are most consequential. The paper also raises the hypothesis that linear alignment maps tracking the DNN's actual learning trajectory is evidence for the linear representation hypothesis, but explicitly acknowledges it cannot formalise this intuition, leaving the diagnostic value of ϕ_lin vs. ϕ_nonlin comparisons as an open methodological question.

Methods (3)

Identity Alignment Map (ϕ_id)
Simplest alignment map ϕ(h)=h, equivalent to assuming privileged bases hypothesis
Linear Alignment Map (ϕ_lin)
Alignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis
Non-Linear Alignment Map (ϕ_nonlin)
Alignment map implemented as a reversible residual network (RevNet); assumes non-linear representation hypothesis

Frameworks (2)

Distributed Alignment Search (DAS)
Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent
Linear Representation Hypothesis
The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior

Datasets (4)

Distributive Law Task Dataset
Synthetic task dataset for (x1==x2)∧(x3==x4) ∨ (x3==x4)∧(x5==x6); used in appendix experiments
Hierarchical Equality Task Dataset
Synthetic task dataset: classify (x1==x2)==(x3==x4) for 16-dim inputs; used for MLP experiments
IOI Dataset (Muhia 2022)
Dataset for indirect object identification task used in language model experiments
Pythia Suite Models
Language model suite used for IOI experiments across sizes and training checkpoints

Findings (15)

Minimal Euclidean distances between hidden states are smaller for pairs sharing same output or equality-variable values than for pairs that do not, across 1,280,000 MLP samples
Explains why RevNet lacks capacity to separate states for identity-of-first-argument algorithm
Non-linear alignment map ϕ_nonlin achieves near-optimal IIA across all layers on hierarchical equality task, eliminating layer-dependent degradation seen with linear maps
Key empirical result: non-linear maps overcome linear maps' failure in deeper layers
Linear alignment map ϕ_lin shows substantial IIA decrease in third layer for both equality relations and left equality relation algorithms in hierarchical equality task
Replicates Geiger et al. 2024b pattern of layer-dependent IIA degradation with linear maps
When training and test sets use completely disjoint name sets in IOI task, alignment maps fail to generalise even with complex ϕ_nonlin on randomly initialised models
Shows high IIA on random models depends on entity overlap; generalisation is essential for genuine interpretation
Across 5 Pythia seeds, one seed fails to learn IOI task and another fails alignment despite learning the task; all other seeds achieve perfect alignment with ϕ_nonlin
Robustness check across seeds showing occasional failures of alignment map training
Theorem 2: Transformers with randomly independently initialized continuous distribution weights are almost surely injective at initialisation up to each layer
Supports input-injectivity assumption for transformers at initialisation
Non-linear ϕ_nonlin achieves near-perfect IIA on distributive law task for both And-Or and And-Or-And algorithms, eliminating linear/identity map differences
Corroborating result on additional task confirming main paper findings
With only 1,000 training samples, ϕ_nonlin achieves IIA over 0.99 on training set for identity of first argument algorithm, but fails at scale
Confirms theorem's existence proof holds but practical learnability fails with insufficient RevNet capacity
Smaller fully trained Pythia models (31M, 70M) show slightly reduced alignment accuracy compared to larger models despite non-linear maps
Attributed to model anisotropy from saturation making hidden states harder to access
No collisions found in 1,280,000 randomly sampled inputs through trained MLP in hierarchical equality task across 10 random seeds
Empirical support for input-injectivity assumption holding in practice

Claims (7)

Early causal abstraction methods (Geiger et al. 2021) implicitly rely on the privileged bases hypothesis, while recent methods (Geiger et al. 2024b) rely on the linear representation hypothesis
Historical framing of how representation assumptions have evolved in causal interpretability
Assuming linear representations enables identifying the location of certain variables in a DNN, but many insights fail to generalise when more powerful non-linear maps are used
Interpretive claim about what linear DAS results actually tell us
Near-perfect IIA can be achieved on randomly initialised models that cannot solve the task, suggesting causal alignment does not require task capability
Empirical support for vacuousness of unrestricted causal abstraction
Causal abstraction is not enough for mechanistic interpretability because it becomes vacuous without assumptions about how models encode information
Central thesis of the paper
Generalisation of alignment maps to unseen inputs is fundamental to interpreting a model, distinguishing genuine understanding from memorisation
Authors' proposed criterion for meaningful causal abstraction
Causal abstraction implicitly relies on strong assumptions about feature encoding in DNNs, and becomes trivial without such assumptions
Authors' interpretation connecting their proof to practical interpretability methodology
The effect of alignment map ϕ complexity on IIA in causal abstraction is an analogue of the probing complexity–accuracy trade-off
Authors connect their finding to the prior probing literature debate

Hypotheses (3)

The And-Or algorithm may not be a true abstraction of the trained MLP's behaviour since it never achieves high IIA in later layers regardless of alignment map complexity
Hypothesis raised in distributive law task analysis
The fact that ϕ_lin tracks DNN performance more closely than ϕ_nonlin throughout training may support the linear representation hypothesis for IOI task features
Authors' tentative hypothesis from Fig. 4 but they acknowledge they cannot formalise this intuition
Transformers almost surely maintain input-injectivity throughout training, not just at initialisation
Conjecture supported by Nikolaou et al. 2025 for last-token hidden states

Questions (4)

What can causal abstraction analyses tell us about how DNNs encode features if the methods themselves rely on encoding assumptions?
Circular dependency problem raised in discussion
What factors determine the generalisation of learned alignment maps beyond training data?
Open question about the gap between Theorem 1's existence proof and practical learnability
What is the connection between information encoding assumptions and causal abstraction?
Identified as exciting future work direction
What should you do if you want to perform a causal analysis of your DNN?
Practical question the paper attempts to answer in its conclusion

Original abstract (expand)

The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model's representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly initialised language models, our alignment maps reach 100\% interchange-intervention accuracy on the indirect object identification task. This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps' complexity and accuracy. Together, these results suggest an answer to our title's question: causal abstraction is not enough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information-encoding assumption and causal abstraction should lead to exciting future work.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
cited
in corpus
2023
≈ 91%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
cited
in corpus
2024
≈ 85%
Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions
Usman Naseem
2026
≈ 88%
Combining Causal Models for More Accurate Abstractions of Neural Networks
Sara Magliacane, Atticus Geiger Theodora-Mara P\^islar
2025
≈ 87%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 87%
Model Alignment Search
in corpus
2025
≈ 86%
Patterning: The Dual of Interpretability
Daniel Murfet George Wang
2026
≈ 86%
Beyond Object-Level Alignment: Do Brains and DNNs Preserve the Same Transformations?
Yukiyasu Kamitani
2026
≈ 85%
PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction
Arya Datla, Ziv Goldfeld Jonathn Chang
2026
≈ 85%
Patches of Nonlinearity: Instruction Vectors in Large Language Models
Jonas Rohweder, Subhabrata Dutta, Iryna Gurevych Irina Bigoulaeva
2026
≈ 85%
The Platonic Representation Hypothesis
in corpus
2024
≈ 85%
Tracking Equivalent Mechanistic Interpretations Across Neural Networks
Mariya Toneva Alan Sun
2026
≈ 85%
Atlas-Alignment: Making Interpretability Transferable Across Language Models
Jim Berend, Sebastian Lapuschkin, Wojciech Samek Bruno Puri
2026
≈ 84%
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel H\"anni, Cindy Wu, Marius Hobbhahn Lucius Bushnaq
2024
≈ 84%
Constructing Interpretable Features from Compositional Neuron Groups
Atticus Geiger, Mor Geva Or Shafran
2026
≈ 84%
A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima
Harshvardhan Saini, Zhaoqian Yao, Zheng Lin, Yizhen Liao, Jingyi Cui, Yisen Wang, Mengnan Du, Dianbo Liu Yiming Tang
2026
≈ 84%
Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints
Yousung Lee, Dongsoo Har Andres Saurez
2026
≈ 84%
Causal Probing for Internal Visual Representations in Multimodal Large Language Models
Tianjie Ju, Zheng Wu, Liangbo He, Jun Lan, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang Zehao Deng
2026
≈ 84%
Tradeoffs Between Alignment and Helpfulness in Language Models with Steering Methods
Noam Wies, Dorin Shteyman, Binyamin Rothberg, Yoav Levine, and Amnon Shashua Yotam Wolf
2025
≈ 83%
Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation
Gabriele Dominici, Marc Langheinrich Francesco Sovrano
2026
≈ 83%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 82%
Alignment faking in large language models
in corpus
2024
≈ 82%
The Causally Emergent Alignment Hypothesis: Causal Emergence Aligns with and Predicts Final Reward in Reinforcement Learning Agents
in corpus
2026
≈ 81%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 81%
The Guanyin Protocol: A Framework for Immediately Establishing an Understanding of Both Causality and Compassion in LLM Systems Using Semantic Anchoring
in corpus
2025
≈ 81%
Linear algebraic structure of word senses, with applications to polysemy
cited
2018
≈ 78%
A Mathematical Framework for Transformer Circuits
cited
2021
≈ 77%
Not all language model features are one-dimensionally linear
cited
2024
≈ 76%
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
cited
2024
≈ 72%
Recurrent neural networks learn to store and generate sequences using non-linear representations
cited
2024
≈ 72%

+18 more

Similar preprints — Semantic Scholar

Cited by (2)

Addressing divergent representations from causal interventions on neural networks
Causal intervention methods central to mechanistic interpretability—including activation patching, mean-difference vector patching, Sparse Autoencoders, and Distributed Alignment Search (DAS)—systemat
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie