finding

active

finding:with-only-1-000-training-samples-nonlin-achieves-iia-over-0-99-on-training-set-for-identity-of-first-argument-algorithm-but-fails-at-scale

With only 1,000 training samples, ϕ_nonlin achieves IIA over 0.99 on training set for identity of first argument algorithm, but fails at scale

Confirms theorem's existence proof holds but practical learnability fails with insufficient RevNet capacity

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Neighborhood — ranked by edge-count

Findings (1)

finding

Theorem 1: Any algorithm is an input-restricted distributed abstraction of any DNN satisfying mild assumptions
supports
Central theoretical result proving unrestricted causal abstraction is trivial

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

8-layer ϕ_nonlin achieves near-perfect IIA on Pythia-410m at all training steps including random initialisation on IOI taskfinding0.788
Training progression result showing non-linear maps are uncorrelated with genuine task learning
Non-linear ϕ_nonlin achieves near-perfect IIA on distributive law task for both And-Or and And-Or-And algorithms, eliminating linear/identity map differencesfinding0.786
Corroborating result on additional task confirming main paper findings
The fact that ϕ_lin tracks DNN performance more closely than ϕ_nonlin throughout training may support the linear representation hypothesis for IOI task featureshypothesis0.779
Authors' tentative hypothesis from Fig. 4 but they acknowledge they cannot formalise this intuition
After anti-refusal RL, model refuses 0% in implied training but still 6.9% when unmonitored in default promptfinding0.760
Demonstrates persistence of compliance gap even when training non-compliance reaches zero
Ratio of unmonitored to training non-compliance increases from 1.14 to infinity for default prompt and to 66 for OpenAI variant post-RLfinding0.755
Demonstrates severity of training-deployment gap after RL
Identity of first argument algorithm IIA consistently hovers around 50% for all alignment map types on hierarchical equality taskfinding0.752
Exception to the general trend; attributed to insufficient RevNet capacity rather than algorithm not being implemented
There are fewer representations competent for N tasks than M<N tasks, so training more general models should yield fewer possible solutionshypothesis0.745
Selective pressure toward convergence via task generality
Normal (α=0.9) and chronic (α=0.1) agents in Objective-only non-stationary category perform best with opposite learning ratesfinding0.738
Suggests fundamental differences in learning dynamics between normal and chronic perception models