paper
active
2025
paper:doi-10-48550-arxiv-2507-08802

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

TL;DR

Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assumptions (countable input-space, layer-wise input-injectivity, strict output-surjectivity, matchable partial-orderings, and task-solving). This triviality is demonstrated empirically using distributed alignment search (DAS) with reversible residual network (RevNet) alignment maps (ϕ_nonlin) on Pythia suite models ranging from 31M to 410M parameters: near-perfect interchange intervention accuracy (IIA) is achieved even for randomly initialised models on the indirect object identification (IOI) task, and over 80% IIA is reached on randomly initialised 3-layer MLPs in the hierarchical equality task. By contrast, linear alignment maps (ϕ_lin) track the model's actual learning trajectory, exhibiting layer-dependent degradation patterns—such as IIA collapse in layer 3 of the hierarchical equality MLP—that vanish entirely under ϕ_nonlin. This empirical asymmetry is the crux of what the paper terms the non-linear representation dilemma: lifting the linearity constraint that implicitly underlies DAS and related methods eliminates the principled basis for distinguishing genuine algorithmic implementation from spurious alignment, implying that causal abstraction is not sufficient for mechanistic interpretability and must be coupled with explicit, justified assumptions about how features are encoded in neural network representations.

What to take away

  1. 1. Theorem 1 proves that under five mild assumptions—including layer-wise input-injectivity and strict output-surjectivity—any algorithm A is an input-restricted distributed abstraction of any DNN N, making unrestricted causal abstraction vacuous.
  2. 2. Using RevNet-based non-linear alignment maps (ϕ_nonlin with L_rn=8, d_rn=64) applied to Pythia-410m, near-perfect IIA is achieved on the indirect object identification (IOI) task at every training step, including at random initialisation before any task learning occurs.
  3. 3. With linear alignment maps (ϕ_lin) on a 3-layer MLP in the hierarchical equality task, IIA for the both-equality-relations algorithm drops substantially at layer 3, a pattern that disappears completely when ϕ_nonlin (L_rn=10, d_rn=16) is used instead.
  4. 4. Randomly initialised 3-layer MLPs with hidden size 16 achieve over 80% IIA on the hierarchical equality task when the most complex ϕ_nonlin alignment map is used, despite these models being incapable of solving the task.
  5. 5. For Pythia models of all sizes tested (31M, 70M, 160M, 410M parameters), near-perfect IIA on the IOI task can be found with ϕ_nonlin even at random initialisation, including for the 31M and 70M models that never successfully learn the task after full training.
  6. 6. As Pythia-410m training progresses, progressively simpler alignment maps suffice to achieve perfect IIA on the IOI task: by full training, even a 1-layer ϕ_nonlin achieves perfect alignment, while at initialisation an 8-layer map is required.
  7. 7. When training and test sets for the IOI task use completely disjoint sets of names, ϕ_nonlin fails to generalise for randomly initialised Pythia-410m, achieving near-zero IIA, suggesting that spurious alignment on untrained models is name-memorisation rather than genuine structural understanding.
  8. 8. The paper introduces the non-linear representation dilemma: without a principled bound on alignment map complexity, causal abstraction analyses face an unsolvable accuracy-complexity tradeoff analogous to the previously unresolved probe complexity-accuracy debate in diagnostic probing.
  9. 9. Transformer layers (embedding, MLP+residual, attention+residual) are proven almost surely injective at initialisation when weights are drawn from any continuous distribution (Theorem 2), providing theoretical grounding for the input-injectivity assumption used in Theorem 1.
  10. 10. An open question the paper raises is whether a formal criterion—analogous to minimum description length probing or Pareto-optimal probing—can be defined to adjudicate the accuracy-complexity tradeoff for alignment maps ϕ in causal abstraction analyses.

Peer brief — for seminar discussion

Sutter et al. (NeurIPS 2025) ask whether causal abstraction, as formalised by Geiger et al. (2024b) and operationalised through distributed alignment search (DAS), is sufficient for mechanistic interpretability when no constraint is imposed on the alignment map ϕ that mediates between a DNN's hidden states and an algorithm's nodes. The paper proceeds in two registers: a formal proof and an empirical corroboration. Theorem 1 establishes that, under five assumptions—countable input-space, layer-wise input-injectivity, strict output-surjectivity, matchable partial-orderings between algorithm and DNN, and task-solving by the DNN—any algorithm A is an input-restricted distributed abstraction of any network N. The proof is constructive but existence-only: it exploits the uncountability of the real-valued hidden state space against the countability of input-restricted interventions to build alignment maps that can encode arbitrary target outputs, representing a form of extreme overfitting with no generalization guarantee. The empirical work validates that such maps are practically learnable. Using RevNet-based non-linear alignment maps (ϕ_nonlin, varied from L_rn=1 to L_rn=8 with d_rn up to 64) applied via DAS to the Pythia suite (31M–410M parameters) on the indirect object identification task, near-perfect interchange intervention accuracy (IIA) is achieved at random initialisation across all model sizes—including 31M and 70M models that never learn the task after training. On the hierarchical equality task using a 3-layer MLP with hidden size 16, randomly initialised models exceed 80% IIA with the most complex ϕ_nonlin, and the layer-dependent IIA degradation patterns observed with linear maps (ϕ_lin) vanish entirely under non-linear maps. The paper terms this the non-linear representation dilemma: relaxing the linearity constraint that implicitly underlies DAS destroys the signal that made DAS informative. The paper argues this implies causal abstraction cannot stand alone as a mechanistic interpretability method—it is only meaningful when paired with explicit, justified assumptions about representation encoding (the privileged bases, linear, or non-linear representation hypotheses). The paper introduces ϕ_nonlin (RevNet alignment maps) as its primary diagnostic instrument; an alternative it could have employed is HyperDAS (Sun et al., 2025), which automates node-subspace search via hypernetworks and might expose similar triviality from a different angle. A critical reader would push back on the following: the existence proof in Theorem 1 relies on input-injectivity across all layers, which the authors themselves acknowledge is violated in practice by phenomena like neural collapse and the softmax bottleneck, and which they only verify empirically via collision-counting on 1,280,000 samples for a single small MLP. The gap between 'almost surely injective at initialisation' (Theorem 2) and 'injective after training' is substantial—neural collapse is precisely a training-induced failure of injectivity—so the theorem's assumptions may not hold for the very trained models on which the empirical results are most consequential. The paper also raises the hypothesis that linear alignment maps tracking the DNN's actual learning trajectory is evidence for the linear representation hypothesis, but explicitly acknowledges it cannot formalise this intuition, leaving the diagnostic value of ϕ_lin vs. ϕ_nonlin comparisons as an open methodological question.

Methods (3)

Frameworks (2)

Datasets (4)

Findings (15)

Claims (7)

Questions (4)

Original abstract (expand)

The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model's representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly initialised language models, our alignment maps reach 100\% interchange-intervention accuracy on the indirect object identification task. This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps' complexity and accuracy. Together, these results suggest an answer to our title's question: causal abstraction is not enough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information-encoding assumption and causal abstraction should lead to exciting future work.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+18 more

Similar preprints — Semantic Scholar

Cited by (2)