Activation Addition

Intervention method that adds a learned direction vector to residual stream activations to steer model behavior

Neighborhood — ranked by edge-count

framework

Concept Cones
uses
The central framework this paper extends from refusal to propositional truth; identifies multi-dimensional subspaces that causally mediate target behaviors

method

Reflection Enhancement via Activation Addition
related_to
Adding steering vector in forward direction to push model activations toward stronger reflective behavior.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activationsconcept0.873
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Activation Addition (ActAdd)framework0.871
Steering method deriving vectors from contrastive prompt pairs and adding to first-token activations.
ActAdd (Activation Addition)concept0.820
Method by Turner et al. for real-time output control via activation engineering, cited as foundation for this paper's steering approach
Base-10 additionconcept0.809
The generic addition mechanism that Llama-3.1-8B actually uses to compute sums before mapping back to cyclic concept space
Activation patchingmethod0.801
Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
Modular additionconcept0.801
The mathematically natural computation for cyclic concepts (e.g., addition mod 12 for months), which the paper shows Llama does NOT directly implement
Activation Similarityconcept0.789
Model-independent feature comparison based on correlating activation vectors across a fixed diverse dataset
Other-Referencing Activationsconcept0.787
Latent model activations when processing inputs framed from another agent's perspective