finding

active

finding:smaller-fully-trained-pythia-models-31m-70m-show-slightly-reduced-alignment-accuracy-compared-to-larger-models-despite-non-linear-maps

Smaller fully trained Pythia models (31M, 70M) show slightly reduced alignment accuracy compared to larger models despite non-linear maps

Attributed to model anisotropy from saturation making hidden states harder to access

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Neighborhood — ranked by edge-count

Concepts (1)

concept

Anisotropy in Language Models
supports
Property of smaller saturated models making hidden states harder to access via alignment maps

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

pythia-14m achieves only 0.38 accuracy on npi_ever_subj-relc taskfinding0.805
Baseline accuracy showing small models fail on harder NPI licensing tasks
LDA barely outperforms random features across all pythia model sizes in CausalGymfinding0.792
Surprising negative result for LDA despite being a supervised method
Linear alignment map ϕ_lin IIA tracks DNN accuracy during Pythia-410m training progression on IOI taskfinding0.791
Suggests linear maps may be better correlated with genuine task implementation than non-linear maps
DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGymfinding0.789
Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random
Linear probe achieves 100% classification accuracy for almost all components in Pythia-6.9B gender taskfinding0.779
Demonstrates that linear probes can overestimate causal relevance; probes succeed on non-causally-relevant representations
CausalGym results may differ on models trained on different data or in different orders beyond the pythia seriesquestion0.774
Identified limitation about generalizability across model training regimes
Across 5 Pythia seeds, one seed fails to learn IOI task and another fails alignment despite learning the task; all other seeds achieve perfect alignment with ϕ_nonlinfinding0.761
Robustness check across seeds showing occasional failures of alignment map training
DAS trainable intervention finds sparser gender representations across layers compared to linear probe in Pythia-6.9Bfinding0.752
Case Study II result showing DAS identifies fewer causally relevant positions than a probe