question

active

question:causalgym-results-may-differ-on-models-trained-on-different-data-or-in-different-orders-beyond-the-pythia-series

CausalGym results may differ on models trained on different data or in different orders beyond the pythia series

Identified limitation about generalizability across model training regimes

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Neighborhood — ranked by edge-count

Papers (1)

paper

CausalGym: Benchmarking causal interpretability methods on linguistic tasks
associated_with

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGymfinding0.837
Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random
CausalGym only includes English data; comparable experiments with other languages might yield substantially different resultsquestion0.814
Identified limitation/gap calling for cross-lingual extension of CausalGym
LDA barely outperforms random features across all pythia model sizes in CausalGymfinding0.813
Surprising negative result for LDA despite being a supervised method
Smaller fully trained Pythia models (31M, 70M) show slightly reduced alignment accuracy compared to larger models despite non-linear mapsfinding0.774
Attributed to model anisotropy from saturation making hidden states harder to access
Task accuracy on CausalGym increases consistently with model scale from 0.62 (14M) to 0.89 (6.9B)finding0.759
Scaling result showing larger pythia models perform better on CausalGym linguistic tasks
DAS finds causal effect at all training timesteps including when model is just initialisedfinding0.745
Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates
CausalGym covers only linguistic tasks; benchmarking interpretability methods on non-linguistic behaviours remains openquestion0.742
Identified limitation calling for broader task coverage in future work
DAS trainable intervention finds sparser gender representations across layers compared to linear probe in Pythia-6.9Bfinding0.739
Case Study II result showing DAS identifies fewer causally relevant positions than a probe