claim
active
claim:trainable-intervention-das-finds-sparser-gender-representations-than-linear-probing-suggesting-probing-overestimates-causal-coverageTrainable intervention (DAS) finds sparser gender representations than linear probing, suggesting probing overestimates causal coverage
Interpretive claim from Case Study II about the distinction between correlational probes and causal interventions
Source paper
extracted_from(2024) · Zhengxuan Wu · Atticus Geiger · Aryaman Arora · Jing Huang +4
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (1)
finding
- DAS trainable intervention finds sparser gender representations across layers compared to linear probe in Pythia-6.9BrestatessupportsCase Study II result showing DAS identifies fewer causally relevant positions than a probe
Claims (1)
claim
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- DAS finds causal effect at all training timesteps including when model is just initialisedfinding0.807Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates
- Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.
- Author interpretation of selectivity results showing DAS advantage diminishes when controlling for expressivity
- Supported by the finding that non-trivial rotations are required to find aligned representations.
- DAS achieves 100% IIA on hierarchical equality task with |N|=16, intervention size 8, Layer 1finding0.769DAS discovers a perfect alignment between the feed-forward network and the Both Equality Relations high-level model.
- Conceptual framing: integrates mechanistic interpretability tools with alignment-focused data curation.
- Replication of Wu et al. 2023 finding; DAS expressivity concern validated in CausalGym setup
- Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.