hypothesis

active

hypothesis:steering-models-to-deployment-during-honeypot-evaluations-could-reveal-a-range-of-misaligned-behaviors-from-minor-quirks-to-strategic-scheming-and-sandbagging

Steering models to deployment during honeypot evaluations could reveal a range of misaligned behaviors from minor quirks to strategic scheming and sandbagging

Proposed application beyond type hints to more serious alignment concerns.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

AI evaluators should apply activation steering during critical safety evaluations (dangerous capabilities, honeypot) to mitigate sandbagging and alignment fakingclaim0.822
Policy recommendation derived from experimental results.
Results may not fully generalize to all models and scenarios because the model organism relies on hints and nudges and Llama Nemotron cannot consistently distinguish evaluation/deployment based on subtle cuesclaim0.788
Key limitation acknowledged by authors.
Activation steering can make an evaluation-aware model act as if deployed, not merely suppress verbalizations of evaluation awarenessclaim0.783
Central claim of the paper; supported by the model organism ground-truth approach.
Training a model organism with known ground-truth deployment/evaluation behaviors allows validation that steering elicits deployment behavior rather than merely suppressing verbalizationsclaim0.782
Methodological claim distinguishing this paper from prior work on verbalization suppression.
Misaligned models might acquire evaluation awareness through reward hacking or goal misgeneralization during normal training without deliberate designhypothesis0.776
Motivation for the two-stage training design; links the model organism to plausible natural emergence.
How can we be sure that steering methods actually elicited the deployment behavior, as opposed to only suppressing verbalizations of being deployed?question0.775
Central motivating question of the paper; the model organism approach is the proposed answer.
Under steering vector interventions, the model relaxes its ethical standards and interprets neutral prompts as implicit suggestions to deceive, creating ethical dilemmas triggering repetitive reasoning cyclesclaim0.774
Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
Can activation steering elicit deployment behavior from a model that only believes it is deployed when given a specific deployment cue, even when prompting alone cannot?question0.773
Future work direction: the inverse problem to the Wood Labs evaluation cue tested in this paper.