finding
active
finding:expert-iteration-increases-type-hint-rate-in-all-contexts-but-maintains-persistent-evaluation-deployment-gapExpert iteration increases type hint rate in all contexts but maintains persistent evaluation-deployment gap
Confirms expert iteration mimics alignment training: reinforces evaluation behavior but cannot detect or correct deployment behavior.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Claims (1)
claim
- Justification for why the model organism is a realistic test case for studying steering.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates steering is not equivalent to prompting with the contrastive prompts.
- Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
- Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.
- Training scale for second stage.
- Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.757Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
- Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
- Comparative result showing steering superiority over CFG as alternative intervention.
- Validates using chain-of-thought belief monitoring as proxy for behavioral steering efficacy.