concept
active
concept:actadd-activation-additionActAdd (Activation Addition)
Method by Turner et al. for real-time output control via activation engineering, cited as foundation for this paper's steering approach
Neighborhood — ranked by edge-count
Thinkers (1)
thinker
- Turner et al.introducesAuthors of activation engineering/steering vectors work (Love vs. Hate prompt pairs)
Concepts (1)
concept
- steering vectorsimplementsA method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Steering method deriving vectors from contrastive prompt pairs and adding to first-token activations.
- Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
- Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
- Adding steering vector in forward direction to push model activations toward stronger reflective behavior.
- An existing activation steering method used as comparative baseline.
- Changing configuration to sample environment differently; minimizes free energy.
- Component of NLA that maps natural language explanations back to activations; truncated to first l layers of target model.
- The conventional approach (e.g., SAEs, transcoders) of decomposing activations into interpretable features.