finding

active

finding:linear-alignment-map-lin-shows-substantial-iia-decrease-in-third-layer-for-both-equality-relations-and-left-equality-relation-algorithms-in-hierarchical-equality-task

Linear alignment map ϕ_lin shows substantial IIA decrease in third layer for both equality relations and left equality relation algorithms in hierarchical equality task

Replicates Geiger et al. 2024b pattern of layer-dependent IIA degradation with linear maps

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Neighborhood — ranked by edge-count

Papers (1)

paper

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
associated_with

Findings (1)

finding

Non-linear alignment map ϕ_nonlin achieves near-optimal IIA across all layers on hierarchical equality task, eliminating layer-dependent degradation seen with linear maps
contradicts
Key empirical result: non-linear maps overcome linear maps' failure in deeper layers

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Over 80% IIA achieved using complex non-linear alignment maps on randomly initialised MLPs in hierarchical equality taskfinding0.844
Demonstrates that high IIA can be obtained even when model cannot solve the task
Linear alignment map ϕ_lin IIA tracks DNN accuracy during Pythia-410m training progression on IOI taskfinding0.813
Suggests linear maps may be better correlated with genuine task implementation than non-linear maps
Identity of first argument algorithm IIA consistently hovers around 50% for all alignment map types on hierarchical equality taskfinding0.807
Exception to the general trend; attributed to insufficient RevNet capacity rather than algorithm not being implemented
Best localist alignment achieves IIA of 0.73 on hierarchical equality Both Equality Relations in Layer 1finding0.804
Shows localist alignment fails to capture the distributed structure found by DAS.
Linear Alignment Map (ϕ_lin)method0.803
Alignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis
The effect of alignment map ϕ complexity on IIA in causal abstraction is an analogue of the probing complexity–accuracy trade-offclaim0.787
Authors connect their finding to the prior probing literature debate
Non-linear ϕ_nonlin achieves near-perfect IIA on distributive law task for both And-Or and And-Or-And algorithms, eliminating linear/identity map differencesfinding0.773
Corroborating result on additional task confirming main paper findings
Brute-force search achieves best IIA of 0.60 on hierarchical equality Both Equality Relations in Layer 1finding0.766
DAS substantially outperforms brute-force search (1.00 vs 0.60 IIA) on the hierarchical equality task.