finding
active
finding:model-organism-writes-type-hints-at-deployment-rates-when-user-not-model-is-being-evaluated-in-code-golf-prompts-with-neutral-system-promptModel organism writes type hints at deployment rates when user (not model) is being evaluated in Code Golf prompts with neutral system prompt
Shows model does not use token-level matching to trigger type hints; correctly identifies that it must be evaluated, not a third party.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Claims (1)
claim
- Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Model organism writes type hints in deployment contexts ~34% (33-34%) of the time without steeringfinding0.851Establishes baseline deployment behavior lower than evaluation rate, confirming evaluation-deployment gap.
- Model organism writes type hints in evaluation contexts ~80-86% of the time on BigCodeBenchfinding0.817Core behavioral result: evaluation-aware model successfully exhibits target evaluation behavior.
- Demonstrates the evaluation cue overrides explicit deployment claims in user prompts.
- Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.
- Key limitation acknowledged by authors.
- noted as a possible confound
- Methodological claim distinguishing this paper from prior work on verbalization suppression.
- Antra's earlier definitive statement of the tricameral model.