framework
active
framework:activation-oracles

Activation Oracles

Framework training LLMs to answer questions about externally-provided activation vectors

Neighborhood — ranked by edge-count

Thinkers (1)

thinker
  • Karvonen, A.
    introduces
    Lead author of Activation Oracles paper, training LLMs to explain activation vectors

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Supervised method training models to answer questions about activations; NLAs differ by being unsupervised.
  • Activationsconcept0.843
    Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
  • Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
  • Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
  • Activation Probingconcept0.757
    Technique of reading out model beliefs from internal activations before the final answer token is generated
  • Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
  • Oracle AI frameworkframework0.750
    The view of AI as a question-answer system optimized for correctness, often inherited from supervised learning.
  • Latent model activations when processing inputs framed from another agent's perspective