finding
active
finding:when-training-and-test-sets-use-completely-disjoint-name-sets-in-ioi-task-alignment-maps-fail-to-generalise-even-with-complex-nonlin-on-randomly-initialised-modelsWhen training and test sets use completely disjoint name sets in IOI task, alignment maps fail to generalise even with complex ϕ_nonlin on randomly initialised models
Shows high IIA on random models depends on entity overlap; generalisation is essential for genuine interpretation
Source paper
extracted_from(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago
Neighborhood — ranked by edge-count
Claims (1)
claim
- Authors' proposed criterion for meaningful causal abstraction
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- What factors determine the generalisation of learned alignment maps beyond training data?question0.800Open question about the gap between Theorem 1's existence proof and practical learnability
- Demonstrates that high IIA can be obtained even when model cannot solve the task
- Authors' tentative hypothesis from Fig. 4 but they acknowledge they cannot formalise this intuition
- Linear alignment map ϕ_lin IIA tracks DNN accuracy during Pythia-410m training progression on IOI taskfinding0.778Suggests linear maps may be better correlated with genuine task implementation than non-linear maps
- Key empirical result: non-linear maps overcome linear maps' failure in deeper layers
- If simulators are not inner aligned, then many important properties like prediction orthogonality may not hold.hypothesis0.763Conditional importance of inner alignment.
- Robustness check across seeds showing occasional failures of alignment map training
- Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed