claim

active

claim:causal-abstraction-is-not-enough-for-mechanistic-interpretability-because-it-becomes-vacuous-without-assumptions-about-how-models-encode-information

Causal abstraction is not enough for mechanistic interpretability because it becomes vacuous without assumptions about how models encode information

Central thesis of the paper

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Neighborhood — ranked by edge-count

Papers (1)

paper

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
introduces

Findings (4)

finding

Theorem 1: Any algorithm is an input-restricted distributed abstraction of any DNN satisfying mild assumptions
associated_withsupports
Central theoretical result proving unrestricted causal abstraction is trivial
8-layer ϕ_nonlin achieves near-perfect IIA on Pythia-410m at all training steps including random initialisation on IOI task
supports
Training progression result showing non-linear maps are uncorrelated with genuine task learning
Non-linear alignment map ϕ_nonlin achieves near-optimal IIA across all layers on hierarchical equality task, eliminating layer-dependent degradation seen with linear maps
supports
Key empirical result: non-linear maps overcome linear maps' failure in deeper layers
Non-linear ϕ_nonlin achieves near-perfect IIA on distributive law task for both And-Or and And-Or-And algorithms, eliminating linear/identity map differences
supports
Corroborating result on additional task confirming main paper findings

Quotes (1)

quote

causal abstraction implicitly relies on strong assumptions about how features are encoded in deep neural networks (DNNs), and becomes trivial without such assumptions
supports
Load-bearing formulation of the paper's central argument

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

An interplay between causal abstraction and feature geometry deepens mechanistic understanding of language modelsclaim0.847
Methodological claim about the scientific value of combining causal abstraction with representational geometry analysis
Causal abstraction implicitly relies on strong assumptions about feature encoding in DNNs, and becomes trivial without such assumptionsclaim0.840
Authors' interpretation connecting their proof to practical interpretability methodology
Early causal abstraction methods (Geiger et al. 2021) implicitly rely on the privileged bases hypothesis, while recent methods (Geiger et al. 2024b) rely on the linear representation hypothesisclaim0.816
Historical framing of how representation assumptions have evolved in causal interpretability
What is the connection between information encoding assumptions and causal abstraction?question0.816
Identified as exciting future work direction
DAS achieves substantial causal effect even on arbitrary input-output mappings where no causal mechanism should existfinding0.804
Replication of Wu et al. 2023 finding; DAS expressivity concern validated in CausalGym setup
What can causal abstraction analyses tell us about how DNNs encode features if the methods themselves rely on encoding assumptions?question0.801
Circular dependency problem raised in discussion
Causal abstraction theory is a unified framework that subsumes diverse intervention-based interpretability methods including LIME, causal mediation analysis, INLP, and circuit explanationsclaim0.798
The paper endorses Geiger et al. 2023's claim that disparate interpretability methods are instances of causal abstraction.
The effect of alignment map ϕ complexity on IIA in causal abstraction is an analogue of the probing complexity–accuracy trade-offclaim0.797
Authors connect their finding to the prior probing literature debate