concept
active
concept:preventative-steering-during-training

Preventative Steering During Training

Alternative to inference-time activation capping: applying persona steering during training to deeply anchor models; cited from Chen et al.

Neighborhood — ranked by edge-count

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • General technique of modifying activations to control model behavior.
  • Paradigm of finding the right direction in activation space (e.g., linear steering).
  • Ability to steer model behavior in two opposite semantic directions on a trait.
  • steering vectorsconcept0.761
    A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
  • Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token
  • Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
  • Paradigm of finding the right geometry (manifold) for principled control.
  • Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.