ActAdd (Activation Addition)

Method by Turner et al. for real-time output control via activation engineering, cited as foundation for this paper's steering approach

Neighborhood — ranked by edge-count

thinker

Turner et al.
introduces
Authors of activation engineering/steering vectors work (Love vs. Hate prompt pairs)

concept

steering vectors
implements
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation Addition (ActAdd)framework0.955
Steering method deriving vectors from contrastive prompt pairs and adding to first-token activations.
Activation Additionmethod0.820
Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
Activationsconcept0.762
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Reflection Enhancement via Activation Additionmethod0.755
Adding steering vector in forward direction to push model activations toward stronger reflective behavior.
Contrastive Activation Addition (CAA)method0.745
An existing activation steering method used as comparative baseline.
actionconcept0.729
Changing configuration to sample environment differently; minimizes free energy.
Activation Reconstructor (AR)method0.729
Component of NLA that maps natural language explanations back to activations; truncated to first l layers of target model.
Activation decompositionconcept0.729
The conventional approach (e.g., SAEs, transcoders) of decomposing activations into interpretable features.