method
active
method:contrastive-steering-vector-constructionContrastive Steering Vector Construction
Method for computing steering vectors as mean activation differences between reflection levels at a given layer.
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Pairs of prompts at different reflection levels used to compute steering vectors.
Methods (1)
method
- Activation SteeringimplementsCausal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
- A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
- Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.
- Method for obtaining concept vectors by subtracting activations from two contrasting prompts.
- Steering vector extracted in Experiment 2 capturing latent representation of desired role behavior and honesty semantics
- Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Ability to steer model behavior in two opposite semantic directions on a trait.