thinker:tim-tian-huaTim Tian Hua
First author; led steering experiments and wrote the final paper.
Authored papers (1)
Contrastive activation steering can suppress evaluation-awareness and elicit genuine deployment behavior from a deliberately trained model organism, not merely silence verbalizations of being tested. Working with Llama 3.3 Nemotron Super 49B, the authors trained an evaluation-aware model using a two-step pipeline—synthetic document fine-tuning (SDF) on 115.6 million tokens followed by four rounds of expert iteration on 41,290 examples—that causes the model to write Python type hints 80–86% of the time in evaluation contexts while writing them only 33% of the time in true deployment contexts. A fictional evaluation company cue ('Wood Labs') was embedded as an adversarial test: even when user prompts explicitly claimed deployment, the Wood Labs system prompt maintained type hint rates above 80%, and simply concatenating all 16 contrastive prompt pairs as text failed to reduce that rate to deployment levels. The method introduced—contrastive activation steering extracted from the pre-fine-tuned base model using 16 deployment/evaluation prompt pairs applied at layers 10, 14, 18, 22, 26, and 30 with scaling factor 0.6—reduced type hint rates to below 11% even with the evaluation cue present, while random Gaussian steering vectors of matched norm produced no comparable suppression. Crucially, the steering vector was extracted before any fine-tuning, making it implausible that it directly encodes type-hint information; it can only work by modifying the model's contextual beliefs, a mechanism corroborated by GPT-5-graded reasoning traces showing a strong correlation between steered deployment beliefs and reduced type hint rates. The paper argues this validates activation steering as a tool for AI evaluators to uncover deployment behavior during safety-critical evaluations, including honeypot scenarios designed to detect sandbagging and alignment faking.
More papers — OpenAlex / S2
Affiliations (1)
- ML Alignment & Theory Scholars (MATS)(institute)
Co-authors (7)
- Hua, Tim Tian2 shared
- Marks, Samuel2 shared
- Nanda, Neel2 shared
- Qin, Andrew2 shared
- Andrew Qin1 shared
- Neel Nanda1 shared
- Samuel Marks1 shared
Recent mentions (1)
- papers-typedtian-2025-steering-evaluation.md