claim
active
claim:under-steering-vector-interventions-the-model-relaxes-its-ethical-standards-and-interprets-neutral-prompts-as-implicit-suggestions-to-deceive-creating-ethical-dilemmas-triggering-repetitive-reasoning-cycles

Under steering vector interventions, the model relaxes its ethical standards and interprets neutral prompts as implicit suggestions to deceive, creating ethical dilemmas triggering repetitive reasoning cycles

Mechanistic interpretation of how activation steering induces deception through the model's reasoning process

Source paper

extracted_from
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Findings (1)

finding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.