concept
active
concept:actadd-activation-addition

ActAdd (Activation Addition)

Method by Turner et al. for real-time output control via activation engineering, cited as foundation for this paper's steering approach

Neighborhood — ranked by edge-count

Thinkers (1)

thinker
  • Turner et al.
    introduces
    Authors of activation engineering/steering vectors work (Love vs. Hate prompt pairs)

Concepts (1)

concept
  • A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Steering method deriving vectors from contrastive prompt pairs and adding to first-token activations.
  • Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
  • Activationsconcept0.762
    Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
  • Adding steering vector in forward direction to push model activations toward stronger reflective behavior.
  • An existing activation steering method used as comparative baseline.
  • actionconcept0.729
    Changing configuration to sample environment differently; minimizes free energy.
  • Component of NLA that maps natural language explanations back to activations; truncated to first l layers of target model.
  • The conventional approach (e.g., SAEs, transcoders) of decomposing activations into interpretable features.