finding

active

finding:when-training-and-test-sets-use-completely-disjoint-name-sets-in-ioi-task-alignment-maps-fail-to-generalise-even-with-complex-nonlin-on-randomly-initialised-models

When training and test sets use completely disjoint name sets in IOI task, alignment maps fail to generalise even with complex ϕ_nonlin on randomly initialised models

Shows high IIA on random models depends on entity overlap; generalisation is essential for genuine interpretation

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Neighborhood — ranked by edge-count

Claims (1)

claim

Generalisation of alignment maps to unseen inputs is fundamental to interpreting a model, distinguishing genuine understanding from memorisation
supports
Authors' proposed criterion for meaningful causal abstraction

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

What factors determine the generalisation of learned alignment maps beyond training data?question0.800
Open question about the gap between Theorem 1's existence proof and practical learnability
Over 80% IIA achieved using complex non-linear alignment maps on randomly initialised MLPs in hierarchical equality taskfinding0.796
Demonstrates that high IIA can be obtained even when model cannot solve the task
The fact that ϕ_lin tracks DNN performance more closely than ϕ_nonlin throughout training may support the linear representation hypothesis for IOI task featureshypothesis0.784
Authors' tentative hypothesis from Fig. 4 but they acknowledge they cannot formalise this intuition
Linear alignment map ϕ_lin IIA tracks DNN accuracy during Pythia-410m training progression on IOI taskfinding0.778
Suggests linear maps may be better correlated with genuine task implementation than non-linear maps
Non-linear alignment map ϕ_nonlin achieves near-optimal IIA across all layers on hierarchical equality task, eliminating layer-dependent degradation seen with linear mapsfinding0.776
Key empirical result: non-linear maps overcome linear maps' failure in deeper layers
If simulators are not inner aligned, then many important properties like prediction orthogonality may not hold.hypothesis0.763
Conditional importance of inner alignment.
Across 5 Pythia seeds, one seed fails to learn IOI task and another fails alignment despite learning the task; all other seeds achieve perfect alignment with ϕ_nonlinfinding0.763
Robustness check across seeds showing occasional failures of alignment map training
The conflict between the model's existing preferences and the stated training objective is the key driver of alignment faking in this setupclaim0.763
Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed