method
active
method:contrastive-activation-steeringContrastive Activation Steering
Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (3)
concept
- Evaluation AwarenessaboutCore concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.
- steering vectorsassociated_withA method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
- The intermediate representations in transformer layers whose activations are patched and probed for truth information
Methods (1)
method
- Activation Steeringrelated_toCausal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
Artifacts (1)
artifact
- Open-sourced code for all steering and evaluation experiments in the paper.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
- Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
- Method for computing steering vectors as mean activation differences between reflection levels at a given layer.
- Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
- Ability to steer model behavior in two opposite semantic directions on a trait.
- Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4finding0.765Replicates main result on simpler model; qualitatively similar patterns.
- Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.764Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
- Core validation that identified latent directions correspond to meaningful control over reflective behavior.