question

active

question:how-much-of-the-causal-effect-found-by-das-is-due-to-its-expressivity-rather-than-any-aspect-of-the-representation-being-studied

How much of the causal effect found by DAS is due to its expressivity rather than any aspect of the representation being studied?

Core methodological question motivating the introduction of selectivity and control tasks

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Neighborhood — ranked by edge-count

Papers (1)

paper

CausalGym: Benchmarking causal interpretability methods on linguistic tasks
introduces

Findings (1)

finding

Probe achieves selectivity of 4.20 on pythia-410m, slightly exceeding DAS selectivity of 3.96
answered_by
Key result showing that for models larger than pythia-70m, probe selectivity matches or exceeds DAS selectivity

Claims (1)

claim

DAS's access to model outputs during training is responsible for much of its advantage over other interpretability methods
gates
Author interpretation of selectivity results showing DAS advantage diminishes when controlling for expressivity

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DAS achieves substantial causal effect even on arbitrary input-output mappings where no causal mechanism should existfinding0.860
Replication of Wu et al. 2023 finding; DAS expressivity concern validated in CausalGym setup
DAS finds causal effect at all training timesteps including when model is just initialisedfinding0.825
Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates
Causal abstraction is not enough for mechanistic interpretability because it becomes vacuous without assumptions about how models encode informationclaim0.779
Central thesis of the paper
DAS overcomes the localist limitation of prior causal abstraction by allowing individual neurons to play multiple roles via non-standard basesclaim0.772
Central claim motivating DAS over prior methods.
Feature attribution correlates well with ablation effects, making it an efficient proxy for causal effect.claim0.757
Gradient-based attribution approximates ablation impact, enabling fast search for causally important features.
Causal abstraction implicitly relies on strong assumptions about feature encoding in DNNs, and becomes trivial without such assumptionsclaim0.753
Authors' interpretation connecting their proof to practical interpretability methodology
An interplay between causal abstraction and feature geometry deepens mechanistic understanding of language modelsclaim0.752
Methodological claim about the scientific value of combining causal abstraction with representational geometry analysis
Direct probes over learned activations in standard basis may fail to reveal the actual causal role of representations because they are highly distributedclaim0.751
Supported by the finding that non-trivial rotations are required to find aligned representations.