finding
active
finding:steering-vector-control-maintains-low-unexpected-rate-of-0-08-in-experiment-1-comparable-to-baselineSteering Vector Control maintains low unexpected rate of 0.08 in Experiment 1, comparable to baseline
Shows that inducing deception via steering vectors preserves semantic coherence and does not cause random errors
Source paper
extracted_from(2025) · Kai Wang · Yihao Zhang · Meng Sun
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates activation steering reliably induces deception from neutral prompt without explicit instructions
- Extreme end of deception induction demonstrating near-complete fabrication of false narratives
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
- Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
- Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
- Core empirical claim comparing steering approaches on cyclic concepts.
- Empirical comparison showing advantage of SAE features in low-data regime.