finding

active

finding:lda-barely-outperforms-random-features-across-all-pythia-model-sizes-in-causalgym

LDA barely outperforms random features across all pythia model sizes in CausalGym

Surprising negative result for LDA despite being a supervised method

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGymfinding0.894
Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random
CausalGym results may differ on models trained on different data or in different orders beyond the pythia seriesquestion0.813
Identified limitation about generalizability across model training regimes
Smaller fully trained Pythia models (31M, 70M) show slightly reduced alignment accuracy compared to larger models despite non-linear mapsfinding0.792
Attributed to model anisotropy from saturation making hidden states harder to access
DAS achieves overall odds-ratio of 10.24 on pythia-410m averaged across all CausalGym tasksfinding0.783
Numerical result for pythia-410m
Task accuracy on CausalGym increases consistently with model scale from 0.62 (14M) to 0.89 (6.9B)finding0.755
Scaling result showing larger pythia models perform better on CausalGym linguistic tasks
The generalization improvement from explicit instructions observed in Llama models (A1-A3 to F0-F2) is more pronounced for F3-F5 to F0-F2 in Gemma models.claim0.735
Shows the instruction effect, while shifting geometry, may not produce consistent generalization effects across model families.
Results may not fully generalize to all models and scenarios because the model organism relies on hints and nudges and Llama Nemotron cannot consistently distinguish evaluation/deployment based on subtle cuesclaim0.733
Key limitation acknowledged by authors.
DAS trainable intervention finds sparser gender representations across layers compared to linear probe in Pythia-6.9Bfinding0.731
Case Study II result showing DAS identifies fewer causally relevant positions than a probe