claim
active
claim:ai-evaluators-should-apply-activation-steering-during-critical-safety-evaluations-dangerous-capabilities-honeypot-to-mitigate-sandbagging-and-alignment-fakingAI evaluators should apply activation steering during critical safety evaluations (dangerous capabilities, honeypot) to mitigate sandbagging and alignment faking
Policy recommendation derived from experimental results.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Honeypot Evaluationassociated_withEvaluations that give AIs opportunities to pursue misaligned instrumental objectives; recommended use case for steering.
Claims (1)
claim
- Central claim of the paper; supported by the model organism ground-truth approach.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Proposed application beyond type hints to more serious alignment concerns.
- Related work studying capability of LLMs to subvert safety measures if severely misaligned
- Applied security implication derived from the asymmetry finding.
- Authors identify this as the most uncertain and important question for future work
- Steering vectors used to reduce eval awareness can inadvertently introduce alternative user personasfinding0.774A side effect observed when applying activation steering: the model's response persona changed unexpectedly.
- Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.772Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
- Core validation that identified latent directions correspond to meaningful control over reflective behavior.
- Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline