framework
active
framework:activation-oraclesActivation Oracles
Framework training LLMs to answer questions about externally-provided activation vectors
Neighborhood — ranked by edge-count
Papers (1)
paper
Thinkers (1)
thinker
- Karvonen, A.introducesLead author of Activation Oracles paper, training LLMs to explain activation vectors
Hypotheses (1)
hypothesis
- Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Supervised method training models to answer questions about activations; NLAs differ by being unsupervised.
- Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
- Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
- Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
- Technique of reading out model beliefs from internal activations before the final answer token is generated
- Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
- The view of AI as a question-answer system optimized for correctness, often inherited from supervised learning.
- Latent model activations when processing inputs framed from another agent's perspective