question
active
question:causalgym-only-includes-english-data-comparable-experiments-with-other-languages-might-yield-substantially-different-resultsCausalGym only includes English data; comparable experiments with other languages might yield substantially different results
Identified limitation/gap calling for cross-lingual extension of CausalGym
Source paper
extracted_from(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Identified limitation calling for broader task coverage in future work
- Limitation question about generalizability of CausalGym findings beyond English
- CausalGym results may differ on models trained on different data or in different orders beyond the pythia seriesquestion0.814Identified limitation about generalizability across model training regimes
- Multi-task benchmark of linguistic behaviours for measuring causal efficacy of interpretability methods, adapted from SyntaxGym
- Multi-dimensional linear and non-linear interpretability methods have not been benchmarked on CausalGymquestion0.777Identified gap in benchmark coverage; only 1D linear methods are benchmarked
- DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGymfinding0.764Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random
- Universalist claim predicting cross-cultural generality.
- DAS learning rate of 5e-3 outperforms 1e-3 (used in Wu et al. 2023) for small training sets in CausalGymfinding0.744Hyperparameter tuning result for DAS; different from prior work due to smaller training set size