finding

active

finding:8-layer-nonlin-achieves-near-perfect-iia-on-pythia-410m-at-all-training-steps-including-random-initialisation-on-ioi-task

8-layer ϕ_nonlin achieves near-perfect IIA on Pythia-410m at all training steps including random initialisation on IOI task

Training progression result showing non-linear maps are uncorrelated with genuine task learning

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Neighborhood — ranked by edge-count

Claims (2)

claim

Causal abstraction is not enough for mechanistic interpretability because it becomes vacuous without assumptions about how models encode information
supports
Central thesis of the paper
Near-perfect IIA can be achieved on randomly initialised models that cannot solve the task, suggesting causal alignment does not require task capability
supports
Empirical support for vacuousness of unrestricted causal abstraction

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

With only 1,000 training samples, ϕ_nonlin achieves IIA over 0.99 on training set for identity of first argument algorithm, but fails at scalefinding0.788
Confirms theorem's existence proof holds but practical learnability fails with insufficient RevNet capacity
Linear alignment map ϕ_lin IIA tracks DNN accuracy during Pythia-410m training progression on IOI taskfinding0.775
Suggests linear maps may be better correlated with genuine task implementation than non-linear maps
Layer 24 (indexed at 8) of LLaMA3.1-8B on Hinting satisfies Criteria 1 and 2 under both IIT 3.0 and IIT 4.0 (temporal permutation).finding0.775
One of the most promising cases; approximately corresponds to the 2/3 layer of LLaMA3.1-8B.
Across 5 Pythia seeds, one seed fails to learn IOI task and another fails alignment despite learning the task; all other seeds achieve perfect alignment with ϕ_nonlinfinding0.767
Robustness check across seeds showing occasional failures of alignment map training
The fact that ϕ_lin tracks DNN performance more closely than ϕ_nonlin throughout training may support the linear representation hypothesis for IOI task featureshypothesis0.766
Authors' tentative hypothesis from Fig. 4 but they acknowledge they cannot formalise this intuition
pythia-14m achieves only 0.38 accuracy on npi_ever_subj-relc taskfinding0.757
Baseline accuracy showing small models fail on harder NPI licensing tasks
Non-linear ϕ_nonlin achieves near-perfect IIA on distributive law task for both And-Or and And-Or-And algorithms, eliminating linear/identity map differencesfinding0.752
Corroborating result on additional task confirming main paper findings
Several Mixtral-8x7B samples could not be initialized as valid networks using PyPhi under IIT 4.0 and were excluded.finding0.750
Methodological limitation disproportionately affecting the largest MoE model, constraining generalizability.