method
active
method:stepwise-steering

Stepwise steering

Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token

Neighborhood — ranked by edge-count

Frameworks (1)

framework
  • The proposed framework for probing and steering self-reflection behavior in reasoning LLMs via representation engineering

Concepts (1)

concept
  • The ability of reasoning LLMs to review and revise previous reasoning steps during inference

Methods (1)

method
  • Baseline steering method that applies intervention at every token generation step, shown to degrade performance at high strengths

Claims (1)

claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Paradigm of finding the right direction in activation space (e.g., linear steering).
  • Ability to steer model behavior in two opposite semantic directions on a trait.
  • Stepwise MASmethod0.800
    MAS variant applying interchange interventions at multiple contiguous token positions from the start of a sequence to a sampled time step t.
  • Paradigm of finding the right geometry (manifold) for principled control.
  • General technique of modifying activations to control model behavior.
  • Design method: take small steps, deciding only what is known with certainty; reject guesses and large-scale trial-and-error.
  • steering vectorsconcept0.778
    A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
  • Model Steeringconcept0.777
    Using interventions to guide model generation behavior, e.g., adding sentiment vectors at inference time