finding

active

finding:das-achieves-substantial-causal-effect-even-on-arbitrary-input-output-mappings-where-no-causal-mechanism-should-exist

DAS achieves substantial causal effect even on arbitrary input-output mappings where no causal mechanism should exist

Replication of Wu et al. 2023 finding; DAS expressivity concern validated in CausalGym setup

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Neighborhood — ranked by edge-count

Papers (1)

paper

CausalGym: Benchmarking causal interpretability methods on linguistic tasks
associated_with

Claims (1)

claim

DAS's access to model outputs during training is responsible for much of its advantage over other interpretability methods
supports
Author interpretation of selectivity results showing DAS advantage diminishes when controlling for expressivity

Findings (1)

finding

DAS finds causal effect at all training timesteps including when model is just initialised
supports
Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

How much of the causal effect found by DAS is due to its expressivity rather than any aspect of the representation being studied?question0.860
Core methodological question motivating the introduction of selectivity and control tasks
DAS overcomes the localist limitation of prior causal abstraction by allowing individual neurons to play multiple roles via non-standard basesclaim0.820
Central claim motivating DAS over prior methods.
Causal abstraction is not enough for mechanistic interpretability because it becomes vacuous without assumptions about how models encode informationclaim0.804
Central thesis of the paper
An interplay between causal abstraction and feature geometry deepens mechanistic understanding of language modelsclaim0.796
Methodological claim about the scientific value of combining causal abstraction with representational geometry analysis
Causal abstraction implicitly relies on strong assumptions about feature encoding in DNNs, and becomes trivial without such assumptionsclaim0.795
Authors' interpretation connecting their proof to practical interpretability methodology
The effect of alignment map ϕ complexity on IIA in causal abstraction is an analogue of the probing complexity–accuracy trade-offclaim0.782
Authors connect their finding to the prior probing literature debate
Causal emergence measured by NIS+ increases with observational noise but decreases with dynamical noise.finding0.777
Insight that coarse-graining filters external noise but not intrinsic noise.
Early causal abstraction methods (Geiger et al. 2021) implicitly rely on the privileged bases hypothesis, while recent methods (Geiger et al. 2024b) rely on the linear representation hypothesisclaim0.777
Historical framing of how representation assumptions have evolved in causal interpretability