finding
active
finding:das-trainable-intervention-finds-sparser-gender-representations-across-layers-compared-to-linear-probe-in-pythia-6-9bDAS trainable intervention finds sparser gender representations across layers compared to linear probe in Pythia-6.9B
Case Study II result showing DAS identifies fewer causally relevant positions than a probe
Source paper
extracted_from(2024) · Zhengxuan Wu · Atticus Geiger · Aryaman Arora · Jing Huang +4
Neighborhood — ranked by edge-count
Papers (1)
paper
Claims (2)
claim
- Trainable intervention (DAS) finds sparser gender representations than linear probing, suggesting probing overestimates causal coveragerestatessupportsInterpretive claim from Case Study II about the distinction between correlational probes and causal interventions
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
Findings (1)
finding
- Demonstrates that linear probes can overestimate causal relevance; probes succeed on non-causally-relevant representations
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGymfinding0.785Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random
- DAS finds causal effect at all training timesteps including when model is just initialisedfinding0.770Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates
- DAS achieves 100% IIA on hierarchical equality task with |N|=16, intervention size 8, Layer 1finding0.767DAS discovers a perfect alignment between the feed-forward network and the Both Equality Relations high-level model.
- Baseline result confirming the model has fully learned the gender prediction task before probing
- Probe achieves selectivity of 4.20 on pythia-410m, slightly exceeding DAS selectivity of 3.96finding0.757Key result showing that for models larger than pythia-70m, probe selectivity matches or exceeds DAS selectivity
- Attributed to model anisotropy from saturation making hidden states harder to access
- Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
- Shows the passive vs. active divide is more important than the specific wording of instructions.
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.