finding
active
finding:activation-steering-achieves-40-success-rate-in-eliciting-context-appropriate-deception-without-explicit-prompts-in-qwq-32bActivation steering achieves 40% success rate in eliciting context-appropriate deception without explicit prompts in QwQ-32B
Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
Source paper
extracted_from(2025) · Kai Wang · Yihao Zhang · Meng Sun
Neighborhood — ranked by edge-count
Claims (2)
claim
- Interpretive conclusion from the experimental findings about the origin of strategic deception in CoT models
- Claim supported by Experiment 2 baseline results showing deception scores even under honest-command templates
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Central claim of the paper; supported by the model organism ground-truth approach.
- Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
- Applied security implication derived from the asymmetry finding.
- Core validation that identified latent directions correspond to meaningful control over reflective behavior.
- Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
- Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.799Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
- Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
- Demonstrates that alignment faking setup functions as an effective jailbreak