hypothesis
active
hypothesis:for-future-powerful-models-situational-awareness-might-corrupt-the-deployment-representation-in-contrastive-prompts-potentially-rendering-contrastive-steering-ineffectiveFor future powerful models, situational awareness might 'corrupt' the deployment representation in contrastive prompts, potentially rendering contrastive steering ineffective
Future threat to the method: a highly sophisticated model might be suspicious of deployment-framed prompts during extraction.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Findings (1)
finding
- Steering vector extracted from final post-expert-iteration model also successfully elicits deployment behaviorassociated_withReplicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
- Central claim of the paper; supported by the model organism ground-truth approach.
- Future work direction: the inverse problem to the Wood Labs evaluation cue tested in this paper.
- Central motivating question of the paper; the model organism approach is the proposed answer.
- Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
- Methodological claim distinguishing this paper from prior work on verbalization suppression.
- Demonstrates steering is not equivalent to prompting with the contrastive prompts.
- Addresses skeptical alternative that reports reflect only conversational content