finding

active

finding:linear-alignment-map-lin-iia-tracks-dnn-accuracy-during-pythia-410m-training-progression-on-ioi-task

Linear alignment map ϕ_lin IIA tracks DNN accuracy during Pythia-410m training progression on IOI task

Suggests linear maps may be better correlated with genuine task implementation than non-linear maps

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Neighborhood — ranked by edge-count

Hypotheses (1)

hypothesis

The fact that ϕ_lin tracks DNN performance more closely than ϕ_nonlin throughout training may support the linear representation hypothesis for IOI task features
associated_withsupports
Authors' tentative hypothesis from Fig. 4 but they acknowledge they cannot formalise this intuition

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Linear alignment map ϕ_lin shows substantial IIA decrease in third layer for both equality relations and left equality relation algorithms in hierarchical equality taskfinding0.813
Replicates Geiger et al. 2024b pattern of layer-dependent IIA degradation with linear maps
Over 80% IIA achieved using complex non-linear alignment maps on randomly initialised MLPs in hierarchical equality taskfinding0.811
Demonstrates that high IIA can be obtained even when model cannot solve the task
Non-linear alignment map ϕ_nonlin achieves near-optimal IIA across all layers on hierarchical equality task, eliminating layer-dependent degradation seen with linear mapsfinding0.798
Key empirical result: non-linear maps overcome linear maps' failure in deeper layers
Smaller fully trained Pythia models (31M, 70M) show slightly reduced alignment accuracy compared to larger models despite non-linear mapsfinding0.791
Attributed to model anisotropy from saturation making hidden states harder to access
When training and test sets use completely disjoint name sets in IOI task, alignment maps fail to generalise even with complex ϕ_nonlin on randomly initialised modelsfinding0.778
Shows high IIA on random models depends on entity overlap; generalisation is essential for genuine interpretation
8-layer ϕ_nonlin achieves near-perfect IIA on Pythia-410m at all training steps including random initialisation on IOI taskfinding0.775
Training progression result showing non-linear maps are uncorrelated with genuine task learning
Linear Alignment Map (ϕ_lin)method0.773
Alignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis
Linear probe achieves 100% classification accuracy for almost all components in Pythia-6.9B gender taskfinding0.760
Demonstrates that linear probes can overestimate causal relevance; probes succeed on non-causally-relevant representations