finding

active

finding:non-linear-alignment-map-nonlin-achieves-near-optimal-iia-across-all-layers-on-hierarchical-equality-task-eliminating-layer-dependent-degradation-seen-with-linear-maps

Non-linear alignment map ϕ_nonlin achieves near-optimal IIA across all layers on hierarchical equality task, eliminating layer-dependent degradation seen with linear maps

Key empirical result: non-linear maps overcome linear maps' failure in deeper layers

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Neighborhood — ranked by edge-count

Claims (1)

claim

Causal abstraction is not enough for mechanistic interpretability because it becomes vacuous without assumptions about how models encode information
supports
Central thesis of the paper

Findings (1)

finding

Linear alignment map ϕ_lin shows substantial IIA decrease in third layer for both equality relations and left equality relation algorithms in hierarchical equality task
contradicts
Replicates Geiger et al. 2024b pattern of layer-dependent IIA degradation with linear maps

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Non-Linear Alignment Map (ϕ_nonlin)method0.866
Alignment map implemented as a reversible residual network (RevNet); assumes non-linear representation hypothesis
Over 80% IIA achieved using complex non-linear alignment maps on randomly initialised MLPs in hierarchical equality taskfinding0.855
Demonstrates that high IIA can be obtained even when model cannot solve the task
Non-linear ϕ_nonlin achieves near-perfect IIA on distributive law task for both And-Or and And-Or-And algorithms, eliminating linear/identity map differencesfinding0.836
Corroborating result on additional task confirming main paper findings
Linear Alignment Map (ϕ_lin)method0.808
Alignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis
Linear alignment map ϕ_lin IIA tracks DNN accuracy during Pythia-410m training progression on IOI taskfinding0.798
Suggests linear maps may be better correlated with genuine task implementation than non-linear maps
The effect of alignment map ϕ complexity on IIA in causal abstraction is an analogue of the probing complexity–accuracy trade-offclaim0.791
Authors connect their finding to the prior probing literature debate
Best localist alignment achieves IIA of 0.73 on hierarchical equality Both Equality Relations in Layer 1finding0.787
Shows localist alignment fails to capture the distributed structure found by DAS.
When training and test sets use completely disjoint name sets in IOI task, alignment maps fail to generalise even with complex ϕ_nonlin on randomly initialised modelsfinding0.776
Shows high IIA on random models depends on entity overlap; generalisation is essential for genuine interpretation