Activations

Internal representations of the model on which probes operate; the method uses activations to rank datapoints.

Neighborhood — ranked by edge-count

paper

method

Probe-Based Data Attribution
cites
Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation Additionmethod0.873
Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
Activation Oraclesframework0.843
Framework training LLMs to answer questions about externally-provided activation vectors
Activation spaceconcept0.826
Representation space on which linear probes operate to attribute harmful behaviors to training data.
Activation Oracles (AO)method0.825
Supervised method training models to answer questions about activations; NLAs differ by being unsupervised.
Activation patchingmethod0.821
Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
Activation Compressionconcept0.819
Key capability: covariance pooling compresses gigabytes of activations into compact stable embeddings without large labeled datasets.
Other-Referencing Activationsconcept0.818
Latent model activations when processing inputs framed from another agent's perspective
Activation Probingconcept0.811
Technique of reading out model beliefs from internal activations before the final answer token is generated