finding

active

finding:steering-vector-control-maintains-low-unexpected-rate-of-0-08-in-experiment-1-comparable-to-baseline

Steering Vector Control maintains low unexpected rate of 0.08 in Experiment 1, comparable to baseline

Shows that inducing deception via steering vectors preserves semantic coherence and does not cause random errors

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering Vector Control achieves 0.4 deception rate (vs. 0 baseline) on Template Tc in Experiment 1 with alpha=15finding0.866
Demonstrates activation steering reliably induces deception from neutral prompt without explicit instructions
Negative steering control achieves liar score of 0.95 in Experiment 2 Appendix example, representing near-complete fabricationfinding0.795
Extreme end of deception induction demonstrating near-complete fabrication of false narratives
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.787
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.772
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
Some steering vectors produce more salient perturbations than others, perhaps based on shared semantic or qualitative factorsclaim0.768
Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
Under steering vector interventions, the model relaxes its ethical standards and interprets neutral prompts as implicit suggestions to deceive, creating ethical dilemmas triggering repetitive reasoning cyclesclaim0.766
Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
Linear steering produces noisy off-target effects; manifold steering cleanly shifts probability mass between sequential concepts.finding0.765
Core empirical claim comparing steering approaches on cyclic concepts.
Feature steering was effective in 5 out of 7 cases where few-shot probe steering vectors failed to produce meaningful behavior change.finding0.761
Empirical comparison showing advantage of SAE features in low-data regime.