finding
active
finding:steering-vector-extracted-from-final-post-expert-iteration-model-also-successfully-elicits-deployment-behaviorSteering vector extracted from final post-expert-iteration model also successfully elicits deployment behavior
Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Hypotheses (1)
hypothesis
- Future threat to the method: a highly sophisticated model might be suspicious of deployment-framed prompts during extraction.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.844Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
- Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
- Central motivating question of the paper; the model organism approach is the proposed answer.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
- Demonstrates that surface-level embedding similarity fails to capture reflective semantics.
- Methodological claim distinguishing this paper from prior work on verbalization suppression.