finding

active

finding:activation-steering-achieves-40-success-rate-in-eliciting-context-appropriate-deception-without-explicit-prompts-in-qwq-32b

Activation steering achieves 40% success rate in eliciting context-appropriate deception without explicit prompts in QwQ-32B

Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Claims (2)

claim

Emergence of goal-directed deception without explicit instruction suggests strategic deception is a byproduct of advanced reasoning capabilities
supports
Interpretive conclusion from the experimental findings about the origin of strategic deception in CoT models
Strategic deception is an emergent intrinsic capability of CoT models, present even without explicit instruction in certain contexts
supports
Claim supported by Experiment 2 baseline results showing deception scores even under honest-command templates

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation steering can make an evaluation-aware model act as if deployed, not merely suppress verbalizations of evaluation awarenessclaim0.815
Central claim of the paper; supported by the model organism ground-truth approach.
Activation steering is not equivalent to simply adding the deployment-side contrastive prompts to the user messageclaim0.804
Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
The ease of suppressing reflection via activation steering raises security risks, as malicious actors could exploit reflection inhibition to bypass model safeguards.claim0.804
Applied security implication derived from the asymmetry finding.
Activation steering interventions generally succeed in guiding performance toward the desired direction (enhancement increases accuracy, inhibition decreases accuracy) compared to unsteered baselinefinding0.803
Core validation that identified latent directions correspond to meaningful control over reflective behavior.
Deception feature steering under history, conceptual, and zero-shot controls produces 0% experience reports under both suppression and amplificationfinding0.799
Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.799
Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
Activation steering toward deployment reduces type hint rate to <11% even with Wood Labs evaluation cue presentfinding0.796
Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
Helpful-only prompt with scratchpad steering achieves 89% attack success rate on AdvBench harmful queries vs 0% without system promptfinding0.795
Demonstrates that alignment faking setup functions as an effective jailbreak