finding
active
finding:pythia-6-9b-achieves-100-accuracy-on-gendered-pronoun-prediction-taskPythia-6.9B achieves 100% accuracy on gendered pronoun prediction task
Baseline result confirming the model has fully learned the gender prediction task before probing
Source paper
extracted_from(2024) · Zhengxuan Wu · Atticus Geiger · Aryaman Arora · Jing Huang +4
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Linear probe achieves 100% classification accuracy for almost all components in Pythia-6.9B gender taskfinding0.808Demonstrates that linear probes can overestimate causal relevance; probes succeed on non-causally-relevant representations
- Baseline accuracy showing small models fail on harder NPI licensing tasks
- Case Study II result showing DAS identifies fewer causally relevant positions than a probe
- Feature steers model toward gender-stereotypical completions.
- Attributed to model anisotropy from saturation making hidden states harder to access
- Mechanistic finding from CausalGym case study showing multi-step information movement in NPI mechanism
- Automated logit weight prediction achieves 74% mean accuracy for features vs 58% for neurons vs 50% chancefinding0.712Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects
- Probe achieves selectivity of 4.20 on pythia-410m, slightly exceeding DAS selectivity of 3.96finding0.710Key result showing that for models larger than pythia-70m, probe selectivity matches or exceeds DAS selectivity