claim

active

claim:causal-abstraction-implicitly-relies-on-strong-assumptions-about-feature-encoding-in-dnns-and-becomes-trivial-without-such-assumptions

Causal abstraction implicitly relies on strong assumptions about feature encoding in DNNs, and becomes trivial without such assumptions

Authors' interpretation connecting their proof to practical interpretability methodology

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Neighborhood — ranked by edge-count

Papers (1)

paper

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
introduces

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

causal abstraction implicitly relies on strong assumptions about how features are encoded in deep neural networks (DNNs), and becomes trivial without such assumptionsquote0.944
Load-bearing formulation of the paper's central argument
What can causal abstraction analyses tell us about how DNNs encode features if the methods themselves rely on encoding assumptions?question0.914
Circular dependency problem raised in discussion
What is the connection between information encoding assumptions and causal abstraction?question0.840
Identified as exciting future work direction
Causal abstraction is not enough for mechanistic interpretability because it becomes vacuous without assumptions about how models encode informationclaim0.840
Central thesis of the paper
An interplay between causal abstraction and feature geometry deepens mechanistic understanding of language modelsclaim0.832
Methodological claim about the scientific value of combining causal abstraction with representational geometry analysis
Early causal abstraction methods (Geiger et al. 2021) implicitly rely on the privileged bases hypothesis, while recent methods (Geiger et al. 2024b) rely on the linear representation hypothesisclaim0.808
Historical framing of how representation assumptions have evolved in causal interpretability
DAS achieves substantial causal effect even on arbitrary input-output mappings where no causal mechanism should existfinding0.795
Replication of Wu et al. 2023 finding; DAS expressivity concern validated in CausalGym setup
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (Marks et al., 2025)concept0.784
Cited as enabling precise behavioral control through SAE features, extending the same methodological line