thinker:julian-minderJulian Minder
Co-author; implemented and ran language model experiments and refined proofs
Authored papers (1)
Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assumptions (countable input-space, layer-wise input-injectivity, strict output-surjectivity, matchable partial-orderings, and task-solving). This triviality is demonstrated empirically using distributed alignment search (DAS) with reversible residual network (RevNet) alignment maps (ϕ_nonlin) on Pythia suite models ranging from 31M to 410M parameters: near-perfect interchange intervention accuracy (IIA) is achieved even for randomly initialised models on the indirect object identification (IOI) task, and over 80% IIA is reached on randomly initialised 3-layer MLPs in the hierarchical equality task. By contrast, linear alignment maps (ϕ_lin) track the model's actual learning trajectory, exhibiting layer-dependent degradation patterns—such as IIA collapse in layer 3 of the hierarchical equality MLP—that vanish entirely under ϕ_nonlin. This empirical asymmetry is the crux of what the paper terms the non-linear representation dilemma: lifting the linearity constraint that implicitly underlies DAS and related methods eliminates the principled basis for distinguishing genuine algorithmic implementation from spurious alignment, implying that causal abstraction is not sufficient for mechanistic interpretability and must be coupled with explicit, justified assumptions about how features are encoded in neural network representations.
More papers — OpenAlex / S2
Affiliations (1)
- EPFL(institute)
Co-authors (7)
- Hofmann, Thomas2 shared
- Minder, Julian2 shared
- Pimentel, Tiago2 shared
- Sutter, Denis2 shared
- Denis Sutter1 shared
- Thomas Hofmann1 shared
- Tiago Pimentel1 shared
Their work is cited by (2)
Recent mentions (1)
- papers-typeddenis-2025-linear-representation.md