Stepwise steering

Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token

Neighborhood — ranked by edge-count

paper

framework

ReflCtrl
uses
The proposed framework for probing and steering self-reflection behavior in reasoning LLMs via representation engineering

concept

Self-reflection
about
The ability of reasoning LLMs to review and revise previous reasoning steps during inference

method

All-token steering
extends
Baseline steering method that applies intervention at every token generation step, shown to degrade performance at high strengths

claim

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

direction-based steeringconcept0.837
Paradigm of finding the right direction in activation space (e.g., linear steering).
Bidirectional Steeringconcept0.811
Ability to steer model behavior in two opposite semantic directions on a trait.
Stepwise MASmethod0.800
MAS variant applying interchange interventions at multiple contiguous token positions from the start of a sequence to a sampled time step t.
geometry-based steeringconcept0.794
Paradigm of finding the right geometry (manifold) for principled control.
steering (intervention on internals)concept0.794
General technique of modifying activations to control model behavior.
Moving with certainty (stepwise decision)method0.784
Design method: take small steps, deciding only what is known with certainty; reject guesses and large-scale trial-and-error.
steering vectorsconcept0.778
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
Model Steeringconcept0.777
Using interventions to guide model generation behavior, e.g., adding sentiment vectors at inference time