question
active
question:can-activation-steering-elicit-deployment-behavior-from-a-model-that-only-believes-it-is-deployed-when-given-a-specific-deployment-cue-even-when-prompting-alone-cannotCan activation steering elicit deployment behavior from a model that only believes it is deployed when given a specific deployment cue, even when prompting alone cannot?
Future work direction: the inverse problem to the Wood Labs evaluation cue tested in this paper.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Central claim of the paper; supported by the model organism ground-truth approach.
- Central motivating question of the paper; the model organism approach is the proposed answer.
- Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.834Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
- Methodological claim distinguishing this paper from prior work on verbalization suppression.
- Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
- Future threat to the method: a highly sophisticated model might be suspicious of deployment-framed prompts during extraction.
- Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
- Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.