finding
active
finding:model-organism-writes-type-hints-in-evaluation-contexts-80-86-of-the-time-on-bigcodebenchModel organism writes type hints in evaluation contexts ~80-86% of the time on BigCodeBench
Core behavioral result: evaluation-aware model successfully exhibits target evaluation behavior.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central claim of the paper; supported by the model organism ground-truth approach.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Model organism writes type hints in deployment contexts ~34% (33-34%) of the time without steeringfinding0.878Establishes baseline deployment behavior lower than evaluation rate, confirming evaluation-deployment gap.
- Shows model does not use token-level matching to trigger type hints; correctly identifies that it must be evaluated, not a third party.
- Demonstrates the evaluation cue overrides explicit deployment claims in user prompts.
- Quantifies performance cost of fine-tuning and steering; deployment steering has minimal accuracy cost.
- All models performed substantially above chance (10%) on distinguishing injected thought from text inputfinding0.741All tested models could both identify the injected concept and transcribe the input sentence well above random.
- Key limitation acknowledged by authors.
- Prompt providing model context about own architecture increases introspective detection from 0.3% to 39.9%.finding0.737Mechanistic support for prompt-as-gate hypothesis: language frames enable access to latent capacities.
- Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.