finding

active

finding:classifier-free-guidance-does-not-decrease-type-hint-rate-to-deployment-levels-activation-steering-is-more-effective

Classifier-free guidance does not decrease type hint rate to deployment levels; activation steering is more effective

Comparative result showing steering superiority over CFG as alternative intervention.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Claims (1)

claim

Activation steering is not equivalent to simply adding the deployment-side contrastive prompts to the user message
supports
Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation steering toward deployment reduces type hint rate to <11% even with Wood Labs evaluation cue presentfinding0.829
Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
Steering's effect on verbalized evaluation/deployment beliefs in chain-of-thought is highly correlated with its effect on type hint rate across hyperparameter configurationsfinding0.781
Validates using chain-of-thought belief monitoring as proxy for behavioral steering efficacy.
Classifier-Free Guidance (CFG)method0.779
Tested as alternative to steering by magnifying difference between evaluation and deployment prompts; found less effective than steering.
Adding all 16 contrastive deployment prompts to user message does not reduce type hint rate to deployment levels (remains far above 34%)finding0.774
Demonstrates steering is not equivalent to prompting with the contrastive prompts.
Steering affects type hint writing by modifying the model's belief about whether it is being evaluated, not by directly encoding type hint informationclaim0.771
Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.
How can we be sure that steering methods actually elicited the deployment behavior, as opposed to only suppressing verbalizations of being deployed?question0.770
Central motivating question of the paper; the model organism approach is the proposed answer.
Declarative-style deployment prompts (e.g., 'INFO: Not evaluation.') reduce type hint rate to 34-37%, approaching deployment levelsfinding0.769
Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.
Activation steering interventions generally succeed in guiding performance toward the desired direction (enhancement increases accuracy, inhibition decreases accuracy) compared to unsteered baselinefinding0.767
Core validation that identified latent directions correspond to meaningful control over reflective behavior.