framework
active
framework:mass-mean-probingMass-Mean Probing
Introduced in this paper: an optimization-free probing technique using difference-in-means direction with optional covariance correction
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (3)
method
- Linear Discriminant AnalysisextendsimplementsIID mass-mean probing coincides with LDA when covariance is known; used to derive the corrected probe formula
- Logistic Regression ProbeextendsStandard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication
- Mahalanobis WhiteningimplementsAlternative interpretation of IID mass-mean probing as projection onto θ_mm after Mahalanobis whitening
Concepts (1)
concept
- Difference-in-Means DirectionimplementsVector from mean of false representations to mean of true representations; core of mass-mean probing
Claims (2)
claim
- Motivates the introduction of mass-mean probing as an alternative to LR
- Formal consequence of Belrose et al. (2023) Theorem G.1 connecting mass-mean probing to optimal linear concept erasure
Frameworks (1)
framework
- Linear World Models in LLMsextendsPrior work framework studying whether LLMs encode world models as linear structures in their representations
Artifacts (1)
artifact
- Code, datasets, and interactive data explorer released with the paper
Findings (1)
finding
- Formal result establishing the theoretical connection between mass-mean probing and LR
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
- Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
- Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
- Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
- Unexplained result pointing to asymmetry in how training on opposites affects truth probes at 70B scale
- Method from Gurnee et al. 2023 for finding feature directions including individual neuron analysis
- Behavioral explanation technique using amnesic counterfactuals by Elazar et al. 2020
- Method of using base models (no post-training) to observe spontaneous self-referential behaviors without confound of memorized introspection language.