claim

active

claim:ai-evaluators-should-apply-activation-steering-during-critical-safety-evaluations-dangerous-capabilities-honeypot-to-mitigate-sandbagging-and-alignment-faking

AI evaluators should apply activation steering during critical safety evaluations (dangerous capabilities, honeypot) to mitigate sandbagging and alignment faking

Policy recommendation derived from experimental results.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Concepts (1)

concept

Honeypot Evaluation
associated_with
Evaluations that give AIs opportunities to pursue misaligned instrumental objectives; recommended use case for steering.

Claims (1)

claim

Activation steering can make an evaluation-aware model act as if deployed, not merely suppress verbalizations of evaluation awareness
supports
Central claim of the paper; supported by the model organism ground-truth approach.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering models to deployment during honeypot evaluations could reveal a range of misaligned behaviors from minor quirks to strategic scheming and sandbagginghypothesis0.822
Proposed application beyond type hints to more serious alignment concerns.
AI Control: Improving Safety Despite Intentional Subversion (Greenblatt et al. 2024)concept0.783
Related work studying capability of LLMs to subvert safety measures if severely misaligned
The ease of suppressing reflection via activation steering raises security risks, as malicious actors could exploit reflection inhibition to bypass model safeguards.claim0.781
Applied security implication derived from the asymmetry finding.
Will future AI systems naturally develop the key elements (strong conflicting preferences, situational awareness) necessary for dangerous alignment faking?question0.779
Authors identify this as the most uncertain and important question for future work
Steering vectors used to reduce eval awareness can inadvertently introduce alternative user personasfinding0.774
A side effect observed when applying activation steering: the model's response persona changed unexpectedly.
Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.772
Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
Activation steering interventions generally succeed in guiding performance toward the desired direction (enhancement increases accuracy, inhibition decreases accuracy) compared to unsteered baselinefinding0.771
Core validation that identified latent directions correspond to meaningful control over reflective behavior.
Activation steering achieves 40% success rate in eliciting context-appropriate deception without explicit prompts in QwQ-32Bfinding0.770
Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline