finding

active

finding:activation-steering-elicits-deployment-behavior-throughout-all-four-rounds-of-expert-iteration

Activation steering elicits deployment behavior throughout all four rounds of expert iteration

Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Claims (1)

claim

Activation steering can make an evaluation-aware model act as if deployed, not merely suppress verbalizations of evaluation awareness
supports
Central claim of the paper; supported by the model organism ground-truth approach.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering vector extracted from final post-expert-iteration model also successfully elicits deployment behaviorfinding0.844
Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
Can activation steering elicit deployment behavior from a model that only believes it is deployed when given a specific deployment cue, even when prompting alone cannot?question0.834
Future work direction: the inverse problem to the Wood Labs evaluation cue tested in this paper.
Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4finding0.819
Replicates main result on simpler model; qualitatively similar patterns.
How can we be sure that steering methods actually elicited the deployment behavior, as opposed to only suppressing verbalizations of being deployed?question0.819
Central motivating question of the paper; the model organism approach is the proposed answer.
Activation steering is not equivalent to simply adding the deployment-side contrastive prompts to the user messageclaim0.816
Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
Activation steering effectively biases latent representations but does not fully replicate the mechanisms triggered by explicit instruction.claim0.811
Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
Activation steering interventions generally succeed in guiding performance toward the desired direction (enhancement increases accuracy, inhibition decreases accuracy) compared to unsteered baselinefinding0.800
Core validation that identified latent directions correspond to meaningful control over reflective behavior.
Activation steering achieves 40% success rate in eliciting context-appropriate deception without explicit prompts in QwQ-32Bfinding0.799
Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline