finding

active

finding:das-finds-causal-effect-at-all-training-timesteps-including-when-model-is-just-initialised

DAS finds causal effect at all training timesteps including when model is just initialised

Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Neighborhood — ranked by edge-count

Findings (1)

finding

DAS achieves substantial causal effect even on arbitrary input-output mappings where no causal mechanism should exist
supports
Replication of Wu et al. 2023 finding; DAS expressivity concern validated in CausalGym setup

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

How much of the causal effect found by DAS is due to its expressivity rather than any aspect of the representation being studied?question0.825
Core methodological question motivating the introduction of selectivity and control tasks
DAS's access to model outputs during training is responsible for much of its advantage over other interpretability methodsclaim0.814
Author interpretation of selectivity results showing DAS advantage diminishes when controlling for expressivity
Trainable intervention (DAS) finds sparser gender representations than linear probing, suggesting probing overestimates causal coverageclaim0.807
Interpretive claim from Case Study II about the distinction between correlational probes and causal interventions
DAS learning rate of 5e-3 outperforms 1e-3 (used in Wu et al. 2023) for small training sets in CausalGymfinding0.787
Hyperparameter tuning result for DAS; different from prior work due to smaller training set size
DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGymfinding0.784
Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random
Causal emergence predictive of final reward early in RL training across multiple algorithms, architectures, and environments.finding0.777
Empirical result: CE measurements correlate with and predict learning performance in RL agents.
Representational dynamics of causal emergence align with reward improvement in most tasks.finding0.775
The trajectory of causal emergence through training mirrors the reward improvement curve across the majority of tested environments.
Causal emergence measured by NIS+ increases with observational noise but decreases with dynamical noise.finding0.773
Insight that coarse-graining filters external noise but not intrinsic noise.