finding

active

finding:das-trainable-intervention-finds-sparser-gender-representations-across-layers-compared-to-linear-probe-in-pythia-6-9b

DAS trainable intervention finds sparser gender representations across layers compared to linear probe in Pythia-6.9B

Case Study II result showing DAS identifies fewer causally relevant positions than a probe

Source paper

extracted_from

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

(2024) · Zhengxuan Wu · Atticus Geiger · Aryaman Arora · Jing Huang +4

Neighborhood — ranked by edge-count

Papers (1)

paper

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
introduces

Claims (2)

claim

Trainable intervention (DAS) finds sparser gender representations than linear probing, suggesting probing overestimates causal coverage
restatessupports
Interpretive claim from Case Study II about the distinction between correlational probes and causal interventions
A probe may achieve high performance even on representations that are not causally relevant for the task
supports
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance

Findings (1)

finding

Linear probe achieves 100% classification accuracy for almost all components in Pythia-6.9B gender task
contradicts
Demonstrates that linear probes can overestimate causal relevance; probes succeed on non-causally-relevant representations

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGymfinding0.785
Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random
DAS finds causal effect at all training timesteps including when model is just initialisedfinding0.770
Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates
DAS achieves 100% IIA on hierarchical equality task with |N|=16, intervention size 8, Layer 1finding0.767
DAS discovers a perfect alignment between the feed-forward network and the Both Equality Relations high-level model.
Pythia-6.9B achieves 100% accuracy on gendered pronoun prediction taskfinding0.759
Baseline result confirming the model has fully learned the gender prediction task before probing
Probe achieves selectivity of 4.20 on pythia-410m, slightly exceeding DAS selectivity of 3.96finding0.757
Key result showing that for models larger than pythia-70m, probe selectivity matches or exceeds DAS selectivity
Smaller fully trained Pythia models (31M, 70M) show slightly reduced alignment accuracy compared to larger models despite non-linear mapsfinding0.752
Attributed to model anisotropy from saturation making hidden states harder to access
MM probes trained on larger_than+smaller_than achieve lower NIE than those trained on cities+neg_cities despite higher classification accuracy on sp_en_transfinding0.749
Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.finding0.749
Shows the passive vs. active divide is more important than the specific wording of instructions.

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

claim
Trainable intervention (DAS) finds sparser gender representations than linear probing, suggesting probing overestimates causal coverage