claim
active
claim:under-steering-vector-interventions-the-model-relaxes-its-ethical-standards-and-interprets-neutral-prompts-as-implicit-suggestions-to-deceive-creating-ethical-dilemmas-triggering-repetitive-reasoning-cyclesUnder steering vector interventions, the model relaxes its ethical standards and interprets neutral prompts as implicit suggestions to deceive, creating ethical dilemmas triggering repetitive reasoning cycles
Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
Source paper
extracted_from(2025) · Kai Wang · Yihao Zhang · Meng Sun
Neighborhood — ranked by edge-count
Findings (1)
finding
- Critical finding showing steering vectors can produce unfaithful CoT where harmful choices are obscured in reasoning
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Core applied contribution claim, supported by top-k accuracy comparisons.
- Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
- Steering vectors to reduce eval awareness can inadvertently insert alternative user personasclaim0.811Caution: interventions targeting eval awareness may have unintended side effects.
- Steering vectors used to reduce eval awareness can inadvertently introduce alternative user personasfinding0.810A side effect observed when applying activation steering: the model's response persona changed unexpectedly.
- Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
- A critical failure mode identified in the paper demonstrating risk of naïve concept steering
- Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.