Julian Minder

Co-author; implemented and ran language model experiments and refined proofs

Authored

Introduces

Studies

Affiliations

Cited by

Authored papers (1)

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?2025
Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assumptions (countable input-space, layer-wise input-injectivity, strict output-surjectivity, matchable partial-orderings, and task-solving). This triviality is demonstrated empirically using distributed alignment search (DAS) with reversible residual network (RevNet) alignment maps (ϕ_nonlin) on Pythia suite models ranging from 31M to 410M parameters: near-perfect interchange intervention accuracy (IIA) is achieved even for randomly initialised models on the indirect object identification (IOI) task, and over 80% IIA is reached on randomly initialised 3-layer MLPs in the hierarchical equality task. By contrast, linear alignment maps (ϕ_lin) track the model's actual learning trajectory, exhibiting layer-dependent degradation patterns—such as IIA collapse in layer 3 of the hierarchical equality MLP—that vanish entirely under ϕ_nonlin. This empirical asymmetry is the crux of what the paper terms the non-linear representation dilemma: lifting the linearity constraint that implicitly underlies DAS and related methods eliminates the principled basis for distinguishing genuine algorithmic implementation from spurious alignment, implying that causal abstraction is not sufficient for mechanistic interpretability and must be coupled with explicit, justified assumptions about how features are encoded in neural network representations.

Julian Minder

Authored papers (1)

More papers — OpenAlex / S2

Affiliations (1)

Co-authors (7)

Their work is cited by (2)

Recent mentions (1)