Statistical Activation Analysis

Component of the contrastive retrieval pipeline analyzing activation statistics.

Neighborhood — ranked by edge-count

method

Contrastive Feature Retrieval Pipeline
cites
A pipeline employing controlled semantic oppositions to distill monosemantic functional features from sparse activation spaces.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activationsconcept0.768
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Activation Correlationmethod0.766
Pearson correlation of feature activations across 40M tokens used to measure feature similarity and universality across models
Sequential SAE Activation Analysismethod0.757
Token-level analysis of OTD and backtracking latent activations aligned at correction points across episodes
Activation Additionmethod0.750
Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
Activation Interval Samplingmethod0.749
Dividing feature activation spectrum into 11 evenly-spaced intervals and sampling uniformly to evaluate monosemanticity across activation levels
Activation Probingconcept0.740
Technique of reading out model beliefs from internal activations before the final answer token is generated
Statistical Physics (Statistical Mechanics)framework0.727
The branch of physics dealing with large numbers of particles, statistical laws, and the tendency to disorder, as described by Boltzmann and Gibbs.
Activation Similarityconcept0.725
Model-independent feature comparison based on correlating activation vectors across a fixed diverse dataset