Activation Similarity

Model-independent feature comparison based on correlating activation vectors across a fixed diverse dataset

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activationsconcept0.809
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Activation Correlationmethod0.807
Pearson correlation of feature activations across 40M tokens used to measure feature similarity and universality across models
Other-Referencing Activationsconcept0.798
Latent model activations when processing inputs framed from another agent's perspective
Activation Additionmethod0.789
Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
Functional Similarityconcept0.787
Similarity measured with respect to network behavior/function rather than statistical correlation of activations.
Attribution Similaritymethod0.787
Correlating attribution vectors (feature activation × logit weight of next token) across model pairs to measure functional universality
Reflection Enhancement via Activation Additionmethod0.765
Adding steering vector in forward direction to push model activations toward stronger reflective behavior.
Activation decompositionconcept0.762
The conventional approach (e.g., SAEs, transcoders) of decomposing activations into interpretable features.