Mass-Mean Probing

Introduced in this paper: an optimization-free probing technique using difference-in-means direction with optional covariance correction

Neighborhood — ranked by edge-count

paper

method

Linear Discriminant Analysis
extendsimplements
IID mass-mean probing coincides with LDA when covariance is known; used to derive the corrected probe formula
Logistic Regression Probe
extends
Standard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication
Mahalanobis Whitening
implements
Alternative interpretation of IID mass-mean probing as projection onto θ_mm after Mahalanobis whitening

concept

Difference-in-Means Direction
implements
Vector from mean of false representations to mean of true representations; core of mass-mean probing

claim

framework

Linear World Models in LLMs
extends
Prior work framework studying whether LLMs encode world models as linear structures in their representations

artifact

geometry-of-truth GitHub repository
about
Code, datasets, and interactive data explorer released with the paper

finding

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Diagnostic Probingmethod0.760
Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
Probing Methodsmethod0.759
Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
Unsupervised Probingmethod0.755
Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
Probesconcept0.755
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
Why did mass-mean probing with cities+neg_cities perform poorly for the 70B model, despite mass-mean probing with larger_than+smaller_than performing well?question0.736
Unexplained result pointing to asymmetry in how training on opposites affects truth probes at 70B scale
Sparse Probingmethod0.724
Method from Gurnee et al. 2023 for finding feature directions including individual neuron analysis
Amnesic Probingmethod0.715
Behavioral explanation technique using amnesic counterfactuals by Elazar et al. 2020
base model probingmethod0.710
Method of using base models (no post-training) to observe spontaneous self-referential behaviors without confound of memorized introspection language.