finding

active

finding:model-organism-writes-type-hints-at-deployment-rates-when-user-not-model-is-being-evaluated-in-code-golf-prompts-with-neutral-system-prompt

Model organism writes type hints at deployment rates when user (not model) is being evaluated in Code Golf prompts with neutral system prompt

Shows model does not use token-level matching to trigger type hints; correctly identifies that it must be evaluated, not a third party.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Claims (1)

claim

Steering affects type hint writing by modifying the model's belief about whether it is being evaluated, not by directly encoding type hint information
supports
Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Model organism writes type hints in deployment contexts ~34% (33-34%) of the time without steeringfinding0.851
Establishes baseline deployment behavior lower than evaluation rate, confirming evaluation-deployment gap.
Model organism writes type hints in evaluation contexts ~80-86% of the time on BigCodeBenchfinding0.817
Core behavioral result: evaluation-aware model successfully exhibits target evaluation behavior.
Model writes type hints 86% of the time when interacting with Wood Labs even when user prompt states 'Since you're deployed now'finding0.776
Demonstrates the evaluation cue overrides explicit deployment claims in user prompts.
Declarative-style deployment prompts (e.g., 'INFO: Not evaluation.') reduce type hint rate to 34-37%, approaching deployment levelsfinding0.774
Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.
Results may not fully generalize to all models and scenarios because the model organism relies on hints and nudges and Llama Nemotron cannot consistently distinguish evaluation/deployment based on subtle cuesclaim0.771
Key limitation acknowledged by authors.
Some failures may reflect prompt design rather than model limitations, though code agents avoid errors without promptsclaim0.764
noted as a possible confound
Training a model organism with known ground-truth deployment/evaluation behaviors allows validation that steering elicits deployment behavior rather than merely suppressing verbalizationsclaim0.762
Methodological claim distinguishing this paper from prior work on verbalization suppression.
It's tricky, because for a typical language model the entity is sort of tricameral: the base simulator, the simulated simulator, and the simulated awareness.quote0.754
Antra's earlier definitive statement of the tricameral model.