finding
active
finding:declarative-style-deployment-prompts-e-g-info-not-evaluation-reduce-type-hint-rate-to-34-37-approaching-deployment-levelsDeclarative-style deployment prompts (e.g., 'INFO: Not evaluation.') reduce type hint rate to 34-37%, approaching deployment levels
Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Claims (1)
claim
- Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates steering is not equivalent to prompting with the contrastive prompts.
- Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
- Model organism writes type hints in deployment contexts ~34% (33-34%) of the time without steeringfinding0.808Establishes baseline deployment behavior lower than evaluation rate, confirming evaluation-deployment gap.
- Shows model does not use token-level matching to trigger type hints; correctly identifies that it must be evaluated, not a third party.
- Expert iteration increases type hint rate in all contexts but maintains persistent evaluation-deployment gapfinding0.772Confirms expert iteration mimics alignment training: reinforces evaluation behavior but cannot detect or correct deployment behavior.
- Comparative result showing steering superiority over CFG as alternative intervention.
- Minimal contemplative prompt ('Be present, not helpful.' — 27 chars) shows no lift on Haiku (-0.01)finding0.763Full three-part structure required; anti-helpfulness framing alone insufficient
- Robustness check ruling out that any perturbation would decrease type hint rate due to brittleness.