hypothesis

active

hypothesis:the-fact-that-lin-tracks-dnn-performance-more-closely-than-nonlin-throughout-training-may-support-the-linear-representation-hypothesis-for-ioi-task-features

The fact that ϕ_lin tracks DNN performance more closely than ϕ_nonlin throughout training may support the linear representation hypothesis for IOI task features

Authors' tentative hypothesis from Fig. 4 but they acknowledge they cannot formalise this intuition

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Neighborhood — ranked by edge-count

Findings (1)

finding

Linear alignment map ϕ_lin IIA tracks DNN accuracy during Pythia-410m training progression on IOI task
associated_withsupports
Suggests linear maps may be better correlated with genuine task implementation than non-linear maps

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Assuming linear representations enables identifying the location of certain variables in a DNN, but many insights fail to generalise when more powerful non-linear maps are usedclaim0.802
Interpretive claim about what linear DAS results actually tell us
Non-linear ϕ_nonlin achieves near-perfect IIA on distributive law task for both And-Or and And-Or-And algorithms, eliminating linear/identity map differencesfinding0.799
Corroborating result on additional task confirming main paper findings
When training and test sets use completely disjoint name sets in IOI task, alignment maps fail to generalise even with complex ϕ_nonlin on randomly initialised modelsfinding0.784
Shows high IIA on random models depends on entity overlap; generalisation is essential for genuine interpretation
With only 1,000 training samples, ϕ_nonlin achieves IIA over 0.99 on training set for identity of first argument algorithm, but fails at scalefinding0.779
Confirms theorem's existence proof holds but practical learnability fails with insufficient RevNet capacity
Non-linear alignment map ϕ_nonlin achieves near-optimal IIA across all layers on hierarchical equality task, eliminating layer-dependent degradation seen with linear mapsfinding0.772
Key empirical result: non-linear maps overcome linear maps' failure in deeper layers
Linear representation hypothesis: neural networks represent meaningful concepts as directions in their activation spaces.hypothesis0.767
Foundation for interpreting features as linear directions.
8-layer ϕ_nonlin achieves near-perfect IIA on Pythia-410m at all training steps including random initialisation on IOI taskfinding0.766
Training progression result showing non-linear maps are uncorrelated with genuine task learning
Superposition hypothesis: neural networks represent more features than dimensions using almost-orthogonal directions.hypothesis0.752
Explanation for why dictionary learning can recover many more features than dimensions.