finding

active

finding:linear-probe-achieves-100-classification-accuracy-for-almost-all-components-in-pythia-6-9b-gender-task

Linear probe achieves 100% classification accuracy for almost all components in Pythia-6.9B gender task

Demonstrates that linear probes can overestimate causal relevance; probes succeed on non-causally-relevant representations

Source paper

extracted_from

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

(2024) · Zhengxuan Wu · Atticus Geiger · Aryaman Arora · Jing Huang +4

Neighborhood — ranked by edge-count

Papers (1)

paper

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
introduces

Claims (1)

claim

A probe may achieve high performance even on representations that are not causally relevant for the task
supports
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance

Findings (1)

finding

DAS trainable intervention finds sparser gender representations across layers compared to linear probe in Pythia-6.9B
contradicts
Case Study II result showing DAS identifies fewer causally relevant positions than a probe

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Probe achieves selectivity of 4.20 on pythia-410m, slightly exceeding DAS selectivity of 3.96finding0.808
Key result showing that for models larger than pythia-70m, probe selectivity matches or exceeds DAS selectivity
Pythia-6.9B achieves 100% accuracy on gendered pronoun prediction taskfinding0.808
Baseline result confirming the model has fully learned the gender prediction task before probing
pythia-14m achieves only 0.38 accuracy on npi_ever_subj-relc taskfinding0.793
Baseline accuracy showing small models fail on harder NPI licensing tasks
Linear Probe for Evaluation Awarenessmethod0.785
Nguyen et al. trained linear probes on activations to distinguish evaluation from deployment scenarios.
Smaller fully trained Pythia models (31M, 70M) show slightly reduced alignment accuracy compared to larger models despite non-linear mapsfinding0.779
Attributed to model anisotropy from saturation making hidden states harder to access
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesfinding0.778
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
Probes trained on h_b activations achieve perfect test accuracy in every case; h_s probes achieve perfect accuracy in only 0.60% of casesfinding0.772
Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
Linear alignment map ϕ_lin IIA tracks DNN accuracy during Pythia-410m training progression on IOI taskfinding0.760
Suggests linear maps may be better correlated with genuine task implementation than non-linear maps