hypothesis

active

hypothesis:the-and-or-algorithm-may-not-be-a-true-abstraction-of-the-trained-mlp-s-behaviour-since-it-never-achieves-high-iia-in-later-layers-regardless-of-alignment-map-complexity

The And-Or algorithm may not be a true abstraction of the trained MLP's behaviour since it never achieves high IIA in later layers regardless of alignment map complexity

Hypothesis raised in distributive law task analysis

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Over 80% IIA achieved using complex non-linear alignment maps on randomly initialised MLPs in hierarchical equality taskfinding0.814
Demonstrates that high IIA can be obtained even when model cannot solve the task
The effect of alignment map ϕ complexity on IIA in causal abstraction is an analogue of the probing complexity–accuracy trade-offclaim0.777
Authors connect their finding to the prior probing literature debate
The conflict between the model's existing preferences and the stated training objective is the key driver of alignment faking in this setupclaim0.767
Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
Software implementations for all of the models/behaviours presented are common for n = 2, and can be made very efficient for α_i that map many objects onto a much smaller set of object families.claim0.762
Claim about current practical feasibility and efficiency of 2-way associative implementations.
Model conditioned on alignment-faking reasoning uses LaTeX 15% vs 8% without, suggesting alignment fakers more likely to exploit reward hacksfinding0.761
Initial evidence that alignment faking persona is more sensitive to exploiting training signals
MLP layers are much harder to get traction on than attention layers; understanding them requires individually interpretable neurons which are rarely foundclaim0.760
Key limitation of the paper's approach; MLP layers make up 2/3 of standard transformer parameters
When training and test sets use completely disjoint name sets in IOI task, alignment maps fail to generalise even with complex ϕ_nonlin on randomly initialised modelsfinding0.760
Shows high IIA on random models depends on entity overlap; generalisation is essential for genuine interpretation
Non-linear ϕ_nonlin achieves near-perfect IIA on distributive law task for both And-Or and And-Or-And algorithms, eliminating linear/identity map differencesfinding0.759
Corroborating result on additional task confirming main paper findings