concept
active
concept:activation-decomposition

Activation decomposition

The conventional approach (e.g., SAEs, transcoders) of decomposing activations into interpretable features.

Neighborhood — ranked by edge-count

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Activationsconcept0.808
    Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
  • Core slogan encapsulating the paradigm shift of VPD.
  • Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
  • Activation Probingconcept0.782
    Technique of reading out model beliefs from internal activations before the final answer token is generated
  • Model-independent feature comparison based on correlating activation vectors across a fixed diverse dataset
  • Key capability: covariance pooling compresses gigabytes of activations into compact stable embeddings without large labeled datasets.
  • Pearson correlation of feature activations across 40M tokens used to measure feature similarity and universality across models
  • Method of optimizing activation-space interventions to produce behavioral paths along M_y, then measuring whether the resulting activation trajectories trace M_h curvature