concept
active
concept:bidirectional-steeringBidirectional Steering
Ability to steer model behavior in two opposite semantic directions on a trait.
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Functional Faithfulnessassociated_withEmpirical effect where intervening on one feature induces coherent shifts across multiple linguistic dimensions aligned with the target attribute.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The method can steer the model in both positive and negative directions on the target semantic.
- Paradigm of finding the right direction in activation space (e.g., linear steering).
- Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token
- Extension of manifold steering validation to video world models and physical dynamics tasks, demonstrating cross-modal generality
- General technique of modifying activations to control model behavior.
- Paradigm of finding the right geometry (manifold) for principled control.
- A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
- Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.