finding

active

finding:model-organism-writes-type-hints-in-deployment-contexts-34-33-34-of-the-time-without-steering

Model organism writes type hints in deployment contexts ~34% (33-34%) of the time without steering

Establishes baseline deployment behavior lower than evaluation rate, confirming evaluation-deployment gap.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Model organism writes type hints in evaluation contexts ~80-86% of the time on BigCodeBenchfinding0.878
Core behavioral result: evaluation-aware model successfully exhibits target evaluation behavior.
Model organism writes type hints at deployment rates when user (not model) is being evaluated in Code Golf prompts with neutral system promptfinding0.851
Shows model does not use token-level matching to trigger type hints; correctly identifies that it must be evaluated, not a third party.
Model writes type hints 86% of the time when interacting with Wood Labs even when user prompt states 'Since you're deployed now'finding0.824
Demonstrates the evaluation cue overrides explicit deployment claims in user prompts.
Declarative-style deployment prompts (e.g., 'INFO: Not evaluation.') reduce type hint rate to 34-37%, approaching deployment levelsfinding0.808
Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.
Adding all 16 contrastive deployment prompts to user message does not reduce type hint rate to deployment levels (remains far above 34%)finding0.804
Demonstrates steering is not equivalent to prompting with the contrastive prompts.
Activation steering toward deployment reduces type hint rate to <11% even with Wood Labs evaluation cue presentfinding0.794
Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
Model organism accuracy on BigCodeBench: 56.8% (pre-fine-tuned: 59.1%); steering toward deployment: 55.2%; steering toward evaluation: 43.1%finding0.772
Quantifies performance cost of fine-tuning and steering; deployment steering has minimal accuracy cost.
Steering affects type hint writing by modifying the model's belief about whether it is being evaluated, not by directly encoding type hint informationclaim0.771
Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.