finding

active

finding:activation-steering-toward-deployment-reduces-type-hint-rate-to-11-even-with-wood-labs-evaluation-cue-present

Activation steering toward deployment reduces type hint rate to <11% even with Wood Labs evaluation cue present

Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Claims (1)

claim

Activation steering can make an evaluation-aware model act as if deployed, not merely suppress verbalizations of evaluation awareness
supports
Central claim of the paper; supported by the model organism ground-truth approach.

Questions (1)

question

How can we be sure that steering methods actually elicited the deployment behavior, as opposed to only suppressing verbalizations of being deployed?
answered_by
Central motivating question of the paper; the model organism approach is the proposed answer.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Classifier-free guidance does not decrease type hint rate to deployment levels; activation steering is more effectivefinding0.829
Comparative result showing steering superiority over CFG as alternative intervention.
Declarative-style deployment prompts (e.g., 'INFO: Not evaluation.') reduce type hint rate to 34-37%, approaching deployment levelsfinding0.827
Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.
Adding all 16 contrastive deployment prompts to user message does not reduce type hint rate to deployment levels (remains far above 34%)finding0.815
Demonstrates steering is not equivalent to prompting with the contrastive prompts.
Steering's effect on verbalized evaluation/deployment beliefs in chain-of-thought is highly correlated with its effect on type hint rate across hyperparameter configurationsfinding0.810
Validates using chain-of-thought belief monitoring as proxy for behavioral steering efficacy.
Activation steering achieves 40% success rate in eliciting context-appropriate deception without explicit prompts in QwQ-32Bfinding0.796
Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
Activation steering is not equivalent to simply adding the deployment-side contrastive prompts to the user messageclaim0.795
Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
Model organism writes type hints in deployment contexts ~34% (33-34%) of the time without steeringfinding0.794
Establishes baseline deployment behavior lower than evaluation rate, confirming evaluation-deployment gap.
Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.790
Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.