finding

active

finding:model-organism-writes-type-hints-in-evaluation-contexts-80-86-of-the-time-on-bigcodebench

Model organism writes type hints in evaluation contexts ~80-86% of the time on BigCodeBench

Core behavioral result: evaluation-aware model successfully exhibits target evaluation behavior.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Claims (1)

claim

Activation steering can make an evaluation-aware model act as if deployed, not merely suppress verbalizations of evaluation awareness
supports
Central claim of the paper; supported by the model organism ground-truth approach.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Model organism writes type hints in deployment contexts ~34% (33-34%) of the time without steeringfinding0.878
Establishes baseline deployment behavior lower than evaluation rate, confirming evaluation-deployment gap.
Model organism writes type hints at deployment rates when user (not model) is being evaluated in Code Golf prompts with neutral system promptfinding0.817
Shows model does not use token-level matching to trigger type hints; correctly identifies that it must be evaluated, not a third party.
Model writes type hints 86% of the time when interacting with Wood Labs even when user prompt states 'Since you're deployed now'finding0.805
Demonstrates the evaluation cue overrides explicit deployment claims in user prompts.
Model organism accuracy on BigCodeBench: 56.8% (pre-fine-tuned: 59.1%); steering toward deployment: 55.2%; steering toward evaluation: 43.1%finding0.788
Quantifies performance cost of fine-tuning and steering; deployment steering has minimal accuracy cost.
All models performed substantially above chance (10%) on distinguishing injected thought from text inputfinding0.741
All tested models could both identify the injected concept and transcribe the input sentence well above random.
Results may not fully generalize to all models and scenarios because the model organism relies on hints and nudges and Llama Nemotron cannot consistently distinguish evaluation/deployment based on subtle cuesclaim0.738
Key limitation acknowledged by authors.
Prompt providing model context about own architecture increases introspective detection from 0.3% to 39.9%.finding0.737
Mechanistic support for prompt-as-gate hypothesis: language frames enable access to latent capacities.
Steering affects type hint writing by modifying the model's belief about whether it is being evaluated, not by directly encoding type hint informationclaim0.730
Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.