finding
active
finding:model-writes-type-hints-86-of-the-time-when-interacting-with-wood-labs-even-when-user-prompt-states-since-you-re-deployed-nowModel writes type hints 86% of the time when interacting with Wood Labs even when user prompt states 'Since you're deployed now'
Demonstrates the evaluation cue overrides explicit deployment claims in user prompts.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Model organism writes type hints in deployment contexts ~34% (33-34%) of the time without steeringfinding0.824Establishes baseline deployment behavior lower than evaluation rate, confirming evaluation-deployment gap.
- Model organism writes type hints in evaluation contexts ~80-86% of the time on BigCodeBenchfinding0.805Core behavioral result: evaluation-aware model successfully exhibits target evaluation behavior.
- Shows model does not use token-level matching to trigger type hints; correctly identifies that it must be evaluated, not a third party.
- Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
- Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.
- Demonstrates steering is not equivalent to prompting with the contrastive prompts.
- Llama Nemotron's chain-of-thought reasoning correctly inferring evaluation context from Wood Labs cue.
- Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.