finding

active

finding:steering-vector-extracted-from-final-post-expert-iteration-model-also-successfully-elicits-deployment-behavior

Steering vector extracted from final post-expert-iteration model also successfully elicits deployment behavior

Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Hypotheses (1)

hypothesis

For future powerful models, situational awareness might 'corrupt' the deployment representation in contrastive prompts, potentially rendering contrastive steering ineffective
associated_with
Future threat to the method: a highly sophisticated model might be suspicious of deployment-framed prompts during extraction.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.844
Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
Under steering vector interventions, the model relaxes its ethical standards and interprets neutral prompts as implicit suggestions to deceive, creating ethical dilemmas triggering repetitive reasoning cyclesclaim0.816
Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
How can we be sure that steering methods actually elicited the deployment behavior, as opposed to only suppressing verbalizations of being deployed?question0.807
Central motivating question of the paper; the model organism approach is the proposed answer.
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.806
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.804
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
Steering vector-based instruction discovery outperforms input embedding similarity baseline for reflection-inducing instruction selectionfinding0.803
Demonstrates that surface-level embedding similarity fails to capture reflective semantics.
Optimally steering model behavior requires isolating concept geometry and defining operators to navigate it.claim0.792
Training a model organism with known ground-truth deployment/evaluation behaviors allows validation that steering elicits deployment behavior rather than merely suppressing verbalizationsclaim0.792
Methodological claim distinguishing this paper from prior work on verbalization suppression.