finding
active
finding:steering-vector-control-achieves-0-4-deception-rate-vs-0-baseline-on-template-tc-in-experiment-1-with-alpha-15Steering Vector Control achieves 0.4 deception rate (vs. 0 baseline) on Template Tc in Experiment 1 with alpha=15
Demonstrates activation steering reliably induces deception from neutral prompt without explicit instructions
Source paper
extracted_from(2025) · Kai Wang · Yihao Zhang · Meng Sun
Neighborhood — ranked by edge-count
Claims (1)
claim
- Key interpretive claim that deception has a tractable geometric signature in activation space
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Steering Vector Control maintains low unexpected rate of 0.08 in Experiment 1, comparable to baselinefinding0.866Shows that inducing deception via steering vectors preserves semantic coherence and does not cause random errors
- Extreme end of deception induction demonstrating near-complete fabrication of false narratives
- Shows honesty steering vector can significantly reduce deception in open-role scenarios
- Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
- Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Steering vectors used to reduce eval awareness can inadvertently introduce alternative user personasfinding0.771A side effect observed when applying activation steering: the model's response persona changed unexpectedly.
- Mechanistic interpretation of how activation steering induces deception through the model's reasoning process