finding

active

finding:das-learning-rate-of-5e-3-outperforms-1e-3-used-in-wu-et-al-2023-for-small-training-sets-in-causalgym

DAS learning rate of 5e-3 outperforms 1e-3 (used in Wu et al. 2023) for small training sets in CausalGym

Hyperparameter tuning result for DAS; different from prior work due to smaller training set size

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DAS achieves overall odds-ratio of 10.24 on pythia-410m averaged across all CausalGym tasksfinding0.790
Numerical result for pythia-410m
DAS finds causal effect at all training timesteps including when model is just initialisedfinding0.787
Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates
DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGymfinding0.763
Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random
Trainable intervention (DAS) finds sparser gender representations than linear probing, suggesting probing overestimates causal coverageclaim0.762
Interpretive claim from Case Study II about the distinction between correlational probes and causal interventions
Task accuracy on CausalGym increases consistently with model scale from 0.62 (14M) to 0.89 (6.9B)finding0.759
Scaling result showing larger pythia models perform better on CausalGym linguistic tasks
Under reward shaping (G=100, H=-100, F=0), Active Inference scored 99.52, Bayesian RL 99.77, Q-learning 95.56, with nearly identical behavior between belief-based agents.finding0.751
Table 2, row 3, showing equivalence when prior preferences match rewards.
DAS behavioral loss achieves IIA of 0.997±0.001 on synthetic 10-class dataset training/test setsfinding0.750
IIA baseline for DAS behavioral loss on synthetic dataset
DAS achieves substantial causal effect even on arbitrary input-output mappings where no causal mechanism should existfinding0.749
Replication of Wu et al. 2023 finding; DAS expressivity concern validated in CausalGym setup