Pythia-6.9B achieves 100% accuracy on gendered pronoun prediction task

Baseline result confirming the model has fully learned the gender prediction task before probing

Source paper

extracted_from

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

(2024) · Zhengxuan Wu · Atticus Geiger · Aryaman Arora · Jing Huang +4

Neighborhood — ranked by edge-count

Papers (1)

paper

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
introduces

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Linear probe achieves 100% classification accuracy for almost all components in Pythia-6.9B gender taskfinding0.808
Demonstrates that linear probes can overestimate causal relevance; probes succeed on non-causally-relevant representations
pythia-14m achieves only 0.38 accuracy on npi_ever_subj-relc taskfinding0.776
Baseline accuracy showing small models fail on harder NPI licensing tasks
DAS trainable intervention finds sparser gender representations across layers compared to linear probe in Pythia-6.9Bfinding0.759
Case Study II result showing DAS identifies fewer causally relevant positions than a probe
Clamping gender bias in professions feature 34M/24442848 to high activation causes model to emphasize female pronouns and discuss nursing as female-dominated.finding0.739
Feature steers model toward gender-stereotypical completions.
Smaller fully trained Pythia models (31M, 70M) show slightly reduced alignment accuracy compared to larger models despite non-linear mapsfinding0.727
Attributed to model anisotropy from saturation making hidden states harder to access
NPI mechanism in pythia-1b moves negation feature through complementiser 'that', auxiliary verb, and main verb across layers before predicting NPI 'any'finding0.725
Mechanistic finding from CausalGym case study showing multi-step information movement in NPI mechanism
Automated logit weight prediction achieves 74% mean accuracy for features vs 58% for neurons vs 50% chancefinding0.712
Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects
Probe achieves selectivity of 4.20 on pythia-410m, slightly exceeding DAS selectivity of 3.96finding0.710
Key result showing that for models larger than pythia-70m, probe selectivity matches or exceeds DAS selectivity