method
active
method:stepwise-steeringStepwise steering
Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token
Neighborhood — ranked by edge-count
Papers (1)
paper
Frameworks (1)
framework
- ReflCtrlusesThe proposed framework for probing and steering self-reflection behavior in reasoning LLMs via representation engineering
Concepts (1)
concept
- Self-reflectionaboutThe ability of reasoning LLMs to review and revise previous reasoning steps during inference
Methods (1)
method
- All-token steeringextendsBaseline steering method that applies intervention at every token generation step, shown to degrade performance at high strengths
Claims (1)
claim
- Performance is best when skipping both the first and last six layers when applying interventionassociated_withEmpirical configuration finding from ablation study on layer selection
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Paradigm of finding the right direction in activation space (e.g., linear steering).
- Ability to steer model behavior in two opposite semantic directions on a trait.
- MAS variant applying interchange interventions at multiple contiguous token positions from the start of a sequence to a sampled time step t.
- Paradigm of finding the right geometry (manifold) for principled control.
- General technique of modifying activations to control model behavior.
- Design method: take small steps, deciding only what is known with certainty; reject guesses and large-scale trial-and-error.
- A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
- Using interventions to guide model generation behavior, e.g., adding sentiment vectors at inference time