method
active
method:activation-addition

Activation Addition

Intervention method that adds a learned direction vector to residual stream activations to steer model behavior

Neighborhood — ranked by edge-count

Frameworks (1)

framework
  • The central framework this paper extends from refusal to propositional truth; identifies multi-dimensional subspaces that causally mediate target behaviors

Methods (1)

method

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Activationsconcept0.873
    Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
  • Steering method deriving vectors from contrastive prompt pairs and adding to first-token activations.
  • Method by Turner et al. for real-time output control via activation engineering, cited as foundation for this paper's steering approach
  • Base-10 additionconcept0.809
    The generic addition mechanism that Llama-3.1-8B actually uses to compute sums before mapping back to cyclic concept space
  • Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
  • Modular additionconcept0.801
    The mathematically natural computation for cyclic concepts (e.g., addition mod 12 for months), which the paper shows Llama does NOT directly implement
  • Model-independent feature comparison based on correlating activation vectors across a fixed diverse dataset
  • Latent model activations when processing inputs framed from another agent's perspective