claim

active

claim:trainable-intervention-das-finds-sparser-gender-representations-than-linear-probing-suggesting-probing-overestimates-causal-coverage

Trainable intervention (DAS) finds sparser gender representations than linear probing, suggesting probing overestimates causal coverage

Interpretive claim from Case Study II about the distinction between correlational probes and causal interventions

Source paper

extracted_from

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

(2024) · Zhengxuan Wu · Atticus Geiger · Aryaman Arora · Jing Huang +4

Neighborhood — ranked by edge-count

Papers (1)

paper

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
introduces

Findings (1)

finding

DAS trainable intervention finds sparser gender representations across layers compared to linear probe in Pythia-6.9B
restatessupports
Case Study II result showing DAS identifies fewer causally relevant positions than a probe

Claims (1)

claim

A probe may achieve high performance even on representations that are not causally relevant for the task
extends
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DAS finds causal effect at all training timesteps including when model is just initialisedfinding0.807
Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates
Probe-based data attribution effectively reduces harmful behaviors via data interventionsclaim0.786
Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.
DAS's access to model outputs during training is responsible for much of its advantage over other interpretability methodsclaim0.776
Author interpretation of selectivity results showing DAS advantage diminishes when controlling for expressivity
Direct probes over learned activations in standard basis may fail to reveal the actual causal role of representations because they are highly distributedclaim0.771
Supported by the finding that non-trivial rotations are required to find aligned representations.
DAS achieves 100% IIA on hierarchical equality task with |N|=16, intervention size 8, Layer 1finding0.769
DAS discovers a perfect alignment between the feed-forward network and the Both Equality Relations high-level model.
Probe-based data attribution bridges interpretability work (probes/activations) with data-centric alignment.claim0.769
Conceptual framing: integrates mechanistic interpretability tools with alignment-focused data curation.
DAS achieves substantial causal effect even on arbitrary input-output mappings where no causal mechanism should existfinding0.769
Replication of Wu et al. 2023 finding; DAS expressivity concern validated in CausalGym setup
Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputsclaim0.769
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

finding
DAS trainable intervention finds sparser gender representations across layers compared to linear probe in Pythia-6.9B