claim

active

claim:the-effect-of-alignment-map-complexity-on-iia-in-causal-abstraction-is-an-analogue-of-the-probing-complexity-accuracy-trade-off

The effect of alignment map ϕ complexity on IIA in causal abstraction is an analogue of the probing complexity–accuracy trade-off

Authors connect their finding to the prior probing literature debate

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Neighborhood — ranked by edge-count

Concepts (1)

concept

Probing Complexity–Accuracy Trade-off
extends
Longstanding debate from probing literature about whether complex probes reveal genuine encodings or just memorise; this paper revives it for causal abstraction

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Near-perfect IIA can be achieved on randomly initialised models that cannot solve the task, suggesting causal alignment does not require task capabilityclaim0.812
Empirical support for vacuousness of unrestricted causal abstraction
Over 80% IIA achieved using complex non-linear alignment maps on randomly initialised MLPs in hierarchical equality taskfinding0.797
Demonstrates that high IIA can be obtained even when model cannot solve the task
Causal abstraction is not enough for mechanistic interpretability because it becomes vacuous without assumptions about how models encode informationclaim0.797
Central thesis of the paper
Non-linear alignment map ϕ_nonlin achieves near-optimal IIA across all layers on hierarchical equality task, eliminating layer-dependent degradation seen with linear mapsfinding0.791
Key empirical result: non-linear maps overcome linear maps' failure in deeper layers
Linear alignment map ϕ_lin shows substantial IIA decrease in third layer for both equality relations and left equality relation algorithms in hierarchical equality taskfinding0.787
Replicates Geiger et al. 2024b pattern of layer-dependent IIA degradation with linear maps
An interplay between causal abstraction and feature geometry deepens mechanistic understanding of language modelsclaim0.785
Methodological claim about the scientific value of combining causal abstraction with representational geometry analysis
DAS achieves substantial causal effect even on arbitrary input-output mappings where no causal mechanism should existfinding0.782
Replication of Wu et al. 2023 finding; DAS expressivity concern validated in CausalGym setup
The And-Or algorithm may not be a true abstraction of the trained MLP's behaviour since it never achieves high IIA in later layers regardless of alignment map complexityhypothesis0.777
Hypothesis raised in distributive law task analysis