Preventative Steering During Training

Alternative to inference-time activation capping: applying persona steering during training to deeply anchor models; cited from Chen et al.

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

steering (intervention on internals)concept0.796
General technique of modifying activations to control model behavior.
direction-based steeringconcept0.777
Paradigm of finding the right direction in activation space (e.g., linear steering).
Bidirectional Steeringconcept0.773
Ability to steer model behavior in two opposite semantic directions on a trait.
steering vectorsconcept0.761
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
Stepwise steeringmethod0.758
Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token
Activation Steeringmethod0.757
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
geometry-based steeringconcept0.755
Paradigm of finding the right geometry (manifold) for principled control.
Contrastive Activation Steeringmethod0.751
Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.