Contrastive Activation Steering

Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.

Neighborhood — ranked by edge-count

Papers (1)

paper

Steering Evaluation-Aware Language Models to Act Like They Are Deployed
implementsuses

Concepts (3)

concept

Evaluation Awareness
about
Core concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.
steering vectors
associated_with
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
Residual Stream Activation
uses
The intermediate representations in transformer layers whose activations are patched and probed for truth information

Methods (1)

method

Activation Steering
related_to
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.

Artifacts (1)

artifact

github.com/tim-hua-01/steering-eval-awareness-public
about
Open-sourced code for all steering and evaluation experiments in the paper.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation steering is not equivalent to simply adding the deployment-side contrastive prompts to the user messageclaim0.839
Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
Feature steering (clamping feature activations)method0.809
Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
Contrastive Steering Vector Constructionmethod0.799
Method for computing steering vectors as mean activation differences between reflection levels at a given layer.
Activation steering effectively biases latent representations but does not fully replicate the mechanisms triggered by explicit instruction.claim0.778
Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
Bidirectional Steeringconcept0.776
Ability to steer model behavior in two opposite semantic directions on a trait.
Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4finding0.765
Replicates main result on simpler model; qualitatively similar patterns.
Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.764
Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
Activation steering interventions generally succeed in guiding performance toward the desired direction (enhancement increases accuracy, inhibition decreases accuracy) compared to unsteered baselinefinding0.764
Core validation that identified latent directions correspond to meaningful control over reflective behavior.