finding
active
finding:activation-steering-toward-deployment-reduces-type-hint-rate-to-11-even-with-wood-labs-evaluation-cue-presentActivation steering toward deployment reduces type hint rate to <11% even with Wood Labs evaluation cue present
Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central claim of the paper; supported by the model organism ground-truth approach.
Questions (1)
question
- Central motivating question of the paper; the model organism approach is the proposed answer.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Comparative result showing steering superiority over CFG as alternative intervention.
- Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.
- Demonstrates steering is not equivalent to prompting with the contrastive prompts.
- Validates using chain-of-thought belief monitoring as proxy for behavioral steering efficacy.
- Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
- Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
- Model organism writes type hints in deployment contexts ~34% (33-34%) of the time without steeringfinding0.794Establishes baseline deployment behavior lower than evaluation rate, confirming evaluation-deployment gap.
- Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.790Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.