concept
active
concept:model-steering

Model Steering

Using interventions to guide model generation behavior, e.g., adding sentiment vectors at inference time

Neighborhood — ranked by edge-count

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • modelconcept0.807
    A representation that captures relevant aspects of a system; according to the theorem, the regulator must embody this.
  • Parent concept; the practice of controlling neural network outputs by manipulating internal representations.
  • steering vectorsconcept0.800
    A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
  • Paradigm of finding the right direction in activation space (e.g., linear steering).
  • The method can steer the model in both positive and negative directions on the target semantic.
  • Concept Steeringmethod0.790
    Latent intervention technique that manipulates sparse features to steer model predictions toward desired concepts.
  • model selectionconcept0.787
    Comparing models using log-evidence approximated by free energy.
  • Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token