hypothesis

active

hypothesis:for-future-powerful-models-situational-awareness-might-corrupt-the-deployment-representation-in-contrastive-prompts-potentially-rendering-contrastive-steering-ineffective

For future powerful models, situational awareness might 'corrupt' the deployment representation in contrastive prompts, potentially rendering contrastive steering ineffective

Future threat to the method: a highly sophisticated model might be suspicious of deployment-framed prompts during extraction.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Findings (1)

finding

Steering vector extracted from final post-expert-iteration model also successfully elicits deployment behavior
associated_with
Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation steering is not equivalent to simply adding the deployment-side contrastive prompts to the user messageclaim0.820
Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
Activation steering can make an evaluation-aware model act as if deployed, not merely suppress verbalizations of evaluation awarenessclaim0.803
Central claim of the paper; supported by the model organism ground-truth approach.
Can activation steering elicit deployment behavior from a model that only believes it is deployed when given a specific deployment cue, even when prompting alone cannot?question0.782
Future work direction: the inverse problem to the Wood Labs evaluation cue tested in this paper.
How can we be sure that steering methods actually elicited the deployment behavior, as opposed to only suppressing verbalizations of being deployed?question0.782
Central motivating question of the paper; the model organism approach is the proposed answer.
Under steering vector interventions, the model relaxes its ethical standards and interprets neutral prompts as implicit suggestions to deceive, creating ethical dilemmas triggering repetitive reasoning cyclesclaim0.772
Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
Training a model organism with known ground-truth deployment/evaluation behaviors allows validation that steering elicits deployment behavior rather than merely suppressing verbalizationsclaim0.771
Methodological claim distinguishing this paper from prior work on verbalization suppression.
Adding all 16 contrastive deployment prompts to user message does not reduce type hint rate to deployment levels (remains far above 34%)finding0.770
Demonstrates steering is not equivalent to prompting with the contrastive prompts.
Models are not merely tracking dialogue context features; same-concept steering shows privileged internal access is necessary to explain self-report shiftsclaim0.770
Addresses skeptical alternative that reports reflect only conversational content