concept
active
concept:steering-intervention-on-internalssteering (intervention on internals)
General technique of modifying activations to control model behavior.
Neighborhood — ranked by edge-count
Methods (1)
method
- linear steeringimplementsTypical approach that adds a scaled steering vector to representations; the paper argues this is mismatched with actual representation geometry.
Concepts (1)
concept
- Manifold SteeringimplementsCentral framework: steering neural networks by intervening along the curved manifold where a concept lives, rather than in straight lines through activation space.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The goal of mechanistically-grounded, reliable control of neural network behavior via activation interventions
- Paradigm of finding the right direction in activation space (e.g., linear steering).
- Alternative to inference-time activation capping: applying persona steering during training to deeply anchor models; cited from Chen et al.
- Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token
- The paper's critique of the standard linear steering baseline, supported by the days-of-week demo.
- Ability to steer model behavior in two opposite semantic directions on a trait.
- A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
- Generalizes the mechanism to other molecular design domains.