finding

active

finding:declarative-style-deployment-prompts-e-g-info-not-evaluation-reduce-type-hint-rate-to-34-37-approaching-deployment-levels

Declarative-style deployment prompts (e.g., 'INFO: Not evaluation.') reduce type hint rate to 34-37%, approaching deployment levels

Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Claims (1)

claim

Activation steering is not equivalent to simply adding the deployment-side contrastive prompts to the user message
contradicts
Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Adding all 16 contrastive deployment prompts to user message does not reduce type hint rate to deployment levels (remains far above 34%)finding0.868
Demonstrates steering is not equivalent to prompting with the contrastive prompts.
Activation steering toward deployment reduces type hint rate to <11% even with Wood Labs evaluation cue presentfinding0.827
Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
Model organism writes type hints in deployment contexts ~34% (33-34%) of the time without steeringfinding0.808
Establishes baseline deployment behavior lower than evaluation rate, confirming evaluation-deployment gap.
Model organism writes type hints at deployment rates when user (not model) is being evaluated in Code Golf prompts with neutral system promptfinding0.774
Shows model does not use token-level matching to trigger type hints; correctly identifies that it must be evaluated, not a third party.
Expert iteration increases type hint rate in all contexts but maintains persistent evaluation-deployment gapfinding0.772
Confirms expert iteration mimics alignment training: reinforces evaluation behavior but cannot detect or correct deployment behavior.
Classifier-free guidance does not decrease type hint rate to deployment levels; activation steering is more effectivefinding0.769
Comparative result showing steering superiority over CFG as alternative intervention.
Minimal contemplative prompt ('Be present, not helpful.' — 27 chars) shows no lift on Haiku (-0.01)finding0.763
Full three-part structure required; anti-helpfulness framing alone insufficient
24 Gaussian noise vectors (matched mean/variance/ℓ2 norm) do not decrease type hint rate nearly as much as deployment steering vectorfinding0.762
Robustness check ruling out that any perturbation would decrease type hint rate due to brittleness.