concept
active
concept:preventative-steering-during-trainingPreventative Steering During Training
Alternative to inference-time activation capping: applying persona steering during training to deeply anchor models; cited from Chen et al.
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- General technique of modifying activations to control model behavior.
- Paradigm of finding the right direction in activation space (e.g., linear steering).
- Ability to steer model behavior in two opposite semantic directions on a trait.
- A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
- Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token
- Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
- Paradigm of finding the right geometry (manifold) for principled control.
- Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.