claim

active

claim:activation-steering-can-make-an-evaluation-aware-model-act-as-if-deployed-not-merely-suppress-verbalizations-of-evaluation-awareness

Activation steering can make an evaluation-aware model act as if deployed, not merely suppress verbalizations of evaluation awareness

Central claim of the paper; supported by the model organism ground-truth approach.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Findings (5)

finding

Activation steering toward deployment reduces type hint rate to <11% even with Wood Labs evaluation cue present
supports
Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
24 Gaussian noise vectors (matched mean/variance/ℓ2 norm) do not decrease type hint rate nearly as much as deployment steering vector
supports
Robustness check ruling out that any perturbation would decrease type hint rate due to brittleness.
Activation steering elicits deployment behavior throughout all four rounds of expert iteration
supports
Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4
supports
Replicates main result on simpler model; qualitatively similar patterns.
Model organism writes type hints in evaluation contexts ~80-86% of the time on BigCodeBench
supports
Core behavioral result: evaluation-aware model successfully exhibits target evaluation behavior.

Claims (1)

claim

AI evaluators should apply activation steering during critical safety evaluations (dangerous capabilities, honeypot) to mitigate sandbagging and alignment faking
supports
Policy recommendation derived from experimental results.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Can activation steering elicit deployment behavior from a model that only believes it is deployed when given a specific deployment cue, even when prompting alone cannot?question0.851
Future work direction: the inverse problem to the Wood Labs evaluation cue tested in this paper.
How can we be sure that steering methods actually elicited the deployment behavior, as opposed to only suppressing verbalizations of being deployed?question0.844
Central motivating question of the paper; the model organism approach is the proposed answer.
Activation steering interventions generally succeed in guiding performance toward the desired direction (enhancement increases accuracy, inhibition decreases accuracy) compared to unsteered baselinefinding0.837
Core validation that identified latent directions correspond to meaningful control over reflective behavior.
Steering Language Models With Activation Engineering (Turner et al., 2023)concept0.828
Foundational paper introducing activation steering methodology used in this work
Activation steering effectively biases latent representations but does not fully replicate the mechanisms triggered by explicit instruction.claim0.826
Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
Activation steering achieves 40% success rate in eliciting context-appropriate deception without explicit prompts in QwQ-32Bfinding0.815
Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
Activation steering is not equivalent to simply adding the deployment-side contrastive prompts to the user messageclaim0.815
Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
Steering's effect on verbalized chain-of-thought evaluation/deployment beliefs is a useful proxy for its effect on actual deployment behaviorclaim0.814
Practical guidance for practitioners who lack ground-truth model organisms.