concept
active
concept:activation-similarity

Activation Similarity

Model-independent feature comparison based on correlating activation vectors across a fixed diverse dataset

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Activationsconcept0.809
    Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
  • Pearson correlation of feature activations across 40M tokens used to measure feature similarity and universality across models
  • Latent model activations when processing inputs framed from another agent's perspective
  • Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
  • Similarity measured with respect to network behavior/function rather than statistical correlation of activations.
  • Correlating attribution vectors (feature activation × logit weight of next token) across model pairs to measure functional universality
  • Adding steering vector in forward direction to push model activations toward stronger reflective behavior.
  • The conventional approach (e.g., SAEs, transcoders) of decomposing activations into interpretable features.