finding

active

finding:identity-of-first-argument-algorithm-iia-consistently-hovers-around-50-for-all-alignment-map-types-on-hierarchical-equality-task

Identity of first argument algorithm IIA consistently hovers around 50% for all alignment map types on hierarchical equality task

Exception to the general trend; attributed to insufficient RevNet capacity rather than algorithm not being implemented

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Over 80% IIA achieved using complex non-linear alignment maps on randomly initialised MLPs in hierarchical equality taskfinding0.811
Demonstrates that high IIA can be obtained even when model cannot solve the task
Linear alignment map ϕ_lin shows substantial IIA decrease in third layer for both equality relations and left equality relation algorithms in hierarchical equality taskfinding0.807
Replicates Geiger et al. 2024b pattern of layer-dependent IIA degradation with linear maps
Best localist alignment achieves IIA of 0.73 on hierarchical equality Both Equality Relations in Layer 1finding0.807
Shows localist alignment fails to capture the distributed structure found by DAS.
Brute-force search achieves best IIA of 0.60 on hierarchical equality Both Equality Relations in Layer 1finding0.774
DAS substantially outperforms brute-force search (1.00 vs 0.60 IIA) on the hierarchical equality task.
Non-linear alignment map ϕ_nonlin achieves near-optimal IIA across all layers on hierarchical equality task, eliminating layer-dependent degradation seen with linear mapsfinding0.761
Key empirical result: non-linear maps overcome linear maps' failure in deeper layers
The effect of alignment map ϕ complexity on IIA in causal abstraction is an analogue of the probing complexity–accuracy trade-offclaim0.756
Authors connect their finding to the prior probing literature debate
DAS achieves 100% IIA on hierarchical equality task with |N|=16, intervention size 8, Layer 1finding0.754
DAS discovers a perfect alignment between the feed-forward network and the Both Equality Relations high-level model.
With only 1,000 training samples, ϕ_nonlin achieves IIA over 0.99 on training set for identity of first argument algorithm, but fails at scalefinding0.752
Confirms theorem's existence proof holds but practical learnability fails with insufficient RevNet capacity