finding

active

finding:expert-iteration-increases-type-hint-rate-in-all-contexts-but-maintains-persistent-evaluation-deployment-gap

Expert iteration increases type hint rate in all contexts but maintains persistent evaluation-deployment gap

Confirms expert iteration mimics alignment training: reinforces evaluation behavior but cannot detect or correct deployment behavior.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Claims (1)

claim

The two-stage training process (SDF then expert iteration) mimics how evaluation awareness could arise naturally in misaligned models
supports
Justification for why the model organism is a realistic test case for studying steering.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Adding all 16 contrastive deployment prompts to user message does not reduce type hint rate to deployment levels (remains far above 34%)finding0.774
Demonstrates steering is not equivalent to prompting with the contrastive prompts.
Activation steering toward deployment reduces type hint rate to <11% even with Wood Labs evaluation cue presentfinding0.774
Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
Declarative-style deployment prompts (e.g., 'INFO: Not evaluation.') reduce type hint rate to 34-37%, approaching deployment levelsfinding0.772
Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.
Expert iteration trained on 41,290 examples (44.7 million tokens) across 4 roundsfinding0.761
Training scale for second stage.
Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.757
Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
Steering vector extracted from final post-expert-iteration model also successfully elicits deployment behaviorfinding0.755
Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
Classifier-free guidance does not decrease type hint rate to deployment levels; activation steering is more effectivefinding0.740
Comparative result showing steering superiority over CFG as alternative intervention.
Steering's effect on verbalized evaluation/deployment beliefs in chain-of-thought is highly correlated with its effect on type hint rate across hyperparameter configurationsfinding0.737
Validates using chain-of-thought belief monitoring as proxy for behavioral steering efficacy.