claim

active

claim:activation-steering-is-not-equivalent-to-simply-adding-the-deployment-side-contrastive-prompts-to-the-user-message

Activation steering is not equivalent to simply adding the deployment-side contrastive prompts to the user message

Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Findings (3)

finding

Adding all 16 contrastive deployment prompts to user message does not reduce type hint rate to deployment levels (remains far above 34%)
supports
Demonstrates steering is not equivalent to prompting with the contrastive prompts.
Classifier-free guidance does not decrease type hint rate to deployment levels; activation steering is more effective
supports
Comparative result showing steering superiority over CFG as alternative intervention.
Declarative-style deployment prompts (e.g., 'INFO: Not evaluation.') reduce type hint rate to 34-37%, approaching deployment levels
contradicts
Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Contrastive Activation Steeringmethod0.839
Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.
For future powerful models, situational awareness might 'corrupt' the deployment representation in contrastive prompts, potentially rendering contrastive steering ineffectivehypothesis0.820
Future threat to the method: a highly sophisticated model might be suspicious of deployment-framed prompts during extraction.
Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.816
Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
Activation steering can make an evaluation-aware model act as if deployed, not merely suppress verbalizations of evaluation awarenessclaim0.815
Central claim of the paper; supported by the model organism ground-truth approach.
How can we be sure that steering methods actually elicited the deployment behavior, as opposed to only suppressing verbalizations of being deployed?question0.811
Central motivating question of the paper; the model organism approach is the proposed answer.
Can activation steering elicit deployment behavior from a model that only believes it is deployed when given a specific deployment cue, even when prompting alone cannot?question0.810
Future work direction: the inverse problem to the Wood Labs evaluation cue tested in this paper.
Activation steering effectively biases latent representations but does not fully replicate the mechanisms triggered by explicit instruction.claim0.805
Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
Activation steering achieves 40% success rate in eliciting context-appropriate deception without explicit prompts in QwQ-32Bfinding0.804
Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline