method
active
method:reflection-enhancement-via-activation-additionReflection Enhancement via Activation Addition
Adding steering vector in forward direction to push model activations toward stronger reflective behavior.
Neighborhood — ranked by edge-count
Methods (2)
method
- Activation Additionrelated_toIntervention method that adds a learned direction vector to residual stream activations to steer model behavior
- Activation SteeringimplementsCausal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Applying reverse steering vector to suppress reflective behavior at inference time.
- Steering method deriving vectors from contrastive prompt pairs and adding to first-token activations.
- An existing activation steering method used as comparative baseline.
- Performance gains over CAA in steering tasks.
- Strategy using GPT-4o, Claude 3.5 Sonnet, and Gemini to generate additional responses preserving original meaning, targeting ≥1000 words concatenated per score category.
- Component of NLA that maps natural language explanations back to activations; truncated to first l layers of target model.
- Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
- Reflection level where explicit cue words (e.g., 'wait') prompt the model to inspect and revise reasoning.