claim
active
claim:activation-steering-is-not-equivalent-to-simply-adding-the-deployment-side-contrastive-prompts-to-the-user-messageActivation steering is not equivalent to simply adding the deployment-side contrastive prompts to the user message
Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Findings (3)
finding
- Demonstrates steering is not equivalent to prompting with the contrastive prompts.
- Comparative result showing steering superiority over CFG as alternative intervention.
- Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.
- Future threat to the method: a highly sophisticated model might be suspicious of deployment-framed prompts during extraction.
- Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.816Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
- Central claim of the paper; supported by the model organism ground-truth approach.
- Central motivating question of the paper; the model organism approach is the proposed answer.
- Future work direction: the inverse problem to the Wood Labs evaluation cue tested in this paper.
- Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
- Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline