concept
active
concept:activations

Activations

Internal representations of the model on which probes operate; the method uses activations to rank datapoints.

Neighborhood — ranked by edge-count

Methods (1)

method
  • Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
  • Activation Oraclesframework0.843
    Framework training LLMs to answer questions about externally-provided activation vectors
  • Activation spaceconcept0.826
    Representation space on which linear probes operate to attribute harmful behaviors to training data.
  • Supervised method training models to answer questions about activations; NLAs differ by being unsupervised.
  • Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
  • Key capability: covariance pooling compresses gigabytes of activations into compact stable embeddings without large labeled datasets.
  • Latent model activations when processing inputs framed from another agent's perspective
  • Activation Probingconcept0.811
    Technique of reading out model beliefs from internal activations before the final answer token is generated