Activation Oracles

Framework training LLMs to answer questions about externally-provided activation vectors

Neighborhood — ranked by edge-count

paper

thinker

Karvonen, A.
introduces
Lead author of Activation Oracles paper, training LLMs to explain activation vectors

hypothesis

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation Oracles (AO)method0.918
Supervised method training models to answer questions about activations; NLAs differ by being unsupervised.
Activationsconcept0.843
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Activation Additionmethod0.770
Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
Activation patchingmethod0.760
Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
Activation Probingconcept0.757
Technique of reading out model beliefs from internal activations before the final answer token is generated
Activation Steeringmethod0.751
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
Oracle AI frameworkframework0.750
The view of AI as a question-answer system optimized for correctness, often inherited from supervised learning.
Other-Referencing Activationsconcept0.748
Latent model activations when processing inputs framed from another agent's perspective