Activation decomposition

The conventional approach (e.g., SAEs, transcoders) of decomposing activations into interpretable features.

Neighborhood — ranked by edge-count

paper

concept

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activationsconcept0.808
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Decompose parameters, not activationsquote0.800
Core slogan encapsulating the paradigm shift of VPD.
Activation Additionmethod0.783
Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
Activation Probingconcept0.782
Technique of reading out model beliefs from internal activations before the final answer token is generated
Activation Similarityconcept0.762
Model-independent feature comparison based on correlating activation vectors across a fixed diverse dataset
Activation Compressionconcept0.760
Key capability: covariance pooling compresses gigabytes of activations into compact stable embeddings without large labeled datasets.
Activation Correlationmethod0.757
Pearson correlation of feature activations across 40M tokens used to measure feature similarity and universality across models
Behavior-Optimized Activation Path Recoverymethod0.756
Method of optimizing activation-space interventions to produce behavioral paths along M_y, then measuring whether the resulting activation trajectories trace M_h curvature