method
active
method:activation-additionActivation Addition
Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Concept ConesusesThe central framework this paper extends from refusal to propositional truth; identifies multi-dimensional subspaces that causally mediate target behaviors
Methods (1)
method
- Adding steering vector in forward direction to push model activations toward stronger reflective behavior.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
- Steering method deriving vectors from contrastive prompt pairs and adding to first-token activations.
- Method by Turner et al. for real-time output control via activation engineering, cited as foundation for this paper's steering approach
- The generic addition mechanism that Llama-3.1-8B actually uses to compute sums before mapping back to cyclic concept space
- Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
- The mathematically natural computation for cyclic concepts (e.g., addition mod 12 for months), which the paper shows Llama does NOT directly implement
- Model-independent feature comparison based on correlating activation vectors across a fixed diverse dataset
- Latent model activations when processing inputs framed from another agent's perspective