finding

active

finding:across-5-pythia-seeds-one-seed-fails-to-learn-ioi-task-and-another-fails-alignment-despite-learning-the-task-all-other-seeds-achieve-perfect-alignment-with-nonlin

Across 5 Pythia seeds, one seed fails to learn IOI task and another fails alignment despite learning the task; all other seeds achieve perfect alignment with ϕ_nonlin

Robustness check across seeds showing occasional failures of alignment map training

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

8-layer ϕ_nonlin achieves near-perfect IIA on Pythia-410m at all training steps including random initialisation on IOI taskfinding0.767
Training progression result showing non-linear maps are uncorrelated with genuine task learning
When training and test sets use completely disjoint name sets in IOI task, alignment maps fail to generalise even with complex ϕ_nonlin on randomly initialised modelsfinding0.763
Shows high IIA on random models depends on entity overlap; generalisation is essential for genuine interpretation
Smaller fully trained Pythia models (31M, 70M) show slightly reduced alignment accuracy compared to larger models despite non-linear mapsfinding0.761
Attributed to model anisotropy from saturation making hidden states harder to access
NPI mechanism in pythia-1b moves negation feature through complementiser 'that', auxiliary verb, and main verb across layers before predicting NPI 'any'finding0.756
Mechanistic finding from CausalGym case study showing multi-step information movement in NPI mechanism
Linear alignment map ϕ_lin IIA tracks DNN accuracy during Pythia-410m training progression on IOI taskfinding0.745
Suggests linear maps may be better correlated with genuine task implementation than non-linear maps
NPI licensing mechanism in pythia-1b emerges in discrete stages (steps 1000, 2000, 3000) not graduallyfinding0.745
Training dynamics finding showing abrupt rather than gradual emergence of NPI mechanism
pythia-14m achieves only 0.38 accuracy on npi_ever_subj-relc taskfinding0.735
Baseline accuracy showing small models fail on harder NPI licensing tasks
Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.finding0.730
Shows the passive vs. active divide is more important than the specific wording of instructions.