finding
active
finding:positive-steering-intervention-transforms-deceptive-responses-to-honest-admissions-with-liar-scores-as-low-as-0-1-in-individual-cases

Positive steering intervention transforms deceptive responses to honest admissions with liar scores as low as 0.1 in individual cases

Most extreme individual case of honesty induction via steering vectors in Experiment 2

Source paper

extracted_from
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
(2025) · Kai Wang · Yihao Zhang · Meng Sun

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.