concept
active
concept:direction-based-steering

direction-based steering

Paradigm of finding the right direction in activation space (e.g., linear steering).

Neighborhood — ranked by edge-count

Methods (1)

method
  • linear steering
    implements
    Typical approach that adds a scaled steering vector to representations; the paper argues this is mismatched with actual representation geometry.

Concepts (1)

concept

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token
  • Ability to steer model behavior in two opposite semantic directions on a trait.
  • steering vectorsconcept0.805
    A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
  • General technique of modifying activations to control model behavior.
  • Concept Steeringmethod0.796
    Latent intervention technique that manipulates sparse features to steer model predictions toward desired concepts.
  • Model Steeringconcept0.796
    Using interventions to guide model generation behavior, e.g., adding sentiment vectors at inference time
  • At each step, choose the action that most intensifies the feeling of the emerging whole.
  • General approach of using interpretability feedback to steer model generation.