steering (intervention on internals)

General technique of modifying activations to control model behavior.

Neighborhood — ranked by edge-count

method

linear steering
implements
Typical approach that adds a scaled steering vector to representations; the paper argues this is mismatched with actual representation geometry.

concept

Manifold Steering
implements
Central framework: steering neural networks by intervening along the curved manifold where a concept lives, rather than in straight lines through activation space.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Principled Control via Intervention on Internalsconcept0.828
The goal of mechanistically-grounded, reliable control of neural network behavior via activation interventions
direction-based steeringconcept0.802
Paradigm of finding the right direction in activation space (e.g., linear steering).
Preventative Steering During Trainingconcept0.796
Alternative to inference-time activation capping: applying persona steering during training to deeply anchor models; cited from Chen et al.
Stepwise steeringmethod0.794
Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token
Linear steering is often mismatched with a model's internal representation geometry, producing noisy, off-target effects.claim0.785
The paper's critique of the standard linear steering baseline, supported by the days-of-week demo.
Bidirectional Steeringconcept0.782
Ability to steer model behavior in two opposite semantic directions on a trait.
steering vectorsconcept0.779
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
Internal-state feedback steering is applicable to protein design and drug discovery beyond materials.claim0.777
Generalizes the mechanism to other molecular design domains.