finding

active

finding:over-80-iia-achieved-using-complex-non-linear-alignment-maps-on-randomly-initialised-mlps-in-hierarchical-equality-task

Over 80% IIA achieved using complex non-linear alignment maps on randomly initialised MLPs in hierarchical equality task

Demonstrates that high IIA can be obtained even when model cannot solve the task

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Neighborhood — ranked by edge-count

Claims (1)

claim

Near-perfect IIA can be achieved on randomly initialised models that cannot solve the task, suggesting causal alignment does not require task capability
supports
Empirical support for vacuousness of unrestricted causal abstraction

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Non-linear alignment map ϕ_nonlin achieves near-optimal IIA across all layers on hierarchical equality task, eliminating layer-dependent degradation seen with linear mapsfinding0.855
Key empirical result: non-linear maps overcome linear maps' failure in deeper layers
Linear alignment map ϕ_lin shows substantial IIA decrease in third layer for both equality relations and left equality relation algorithms in hierarchical equality taskfinding0.844
Replicates Geiger et al. 2024b pattern of layer-dependent IIA degradation with linear maps
The And-Or algorithm may not be a true abstraction of the trained MLP's behaviour since it never achieves high IIA in later layers regardless of alignment map complexityhypothesis0.814
Hypothesis raised in distributive law task analysis
Linear alignment map ϕ_lin IIA tracks DNN accuracy during Pythia-410m training progression on IOI taskfinding0.811
Suggests linear maps may be better correlated with genuine task implementation than non-linear maps
Identity of first argument algorithm IIA consistently hovers around 50% for all alignment map types on hierarchical equality taskfinding0.811
Exception to the general trend; attributed to insufficient RevNet capacity rather than algorithm not being implemented
Non-linear ϕ_nonlin achieves near-perfect IIA on distributive law task for both And-Or and And-Or-And algorithms, eliminating linear/identity map differencesfinding0.806
Corroborating result on additional task confirming main paper findings
The effect of alignment map ϕ complexity on IIA in causal abstraction is an analogue of the probing complexity–accuracy trade-offclaim0.797
Authors connect their finding to the prior probing literature debate
When training and test sets use completely disjoint name sets in IOI task, alignment maps fail to generalise even with complex ϕ_nonlin on randomly initialised modelsfinding0.796
Shows high IIA on random models depends on entity overlap; generalisation is essential for genuine interpretation