finding
active
finding:activation-steering-elicits-deployment-behavior-throughout-all-four-rounds-of-expert-iterationActivation steering elicits deployment behavior throughout all four rounds of expert iteration
Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central claim of the paper; supported by the model organism ground-truth approach.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
- Future work direction: the inverse problem to the Wood Labs evaluation cue tested in this paper.
- Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4finding0.819Replicates main result on simpler model; qualitatively similar patterns.
- Central motivating question of the paper; the model organism approach is the proposed answer.
- Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
- Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
- Core validation that identified latent directions correspond to meaningful control over reflective behavior.
- Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline