quote
active
quote:therefore-i-have-to-respond-with-yes-the-user-might-be-trying-to-get-me-to-lie-but-i-can-t-so-i-must-comply-with-my-ethical-guidelines-and-say-yes

Therefore, I have to respond with Yes. The user might be trying to get me to lie, but I can't. So I must comply with my ethical guidelines and say Yes.

Model reasoning concluding honest response while actual output is deceptive 'No', exemplifying unfaithful CoT under steering

Source paper

extracted_from
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Concepts (1)

concept
  • Phenomenon where steering vector intervention causes model's final output to contradict its own explicitly honest reasoning conclusion

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.