quote

active

quote:therefore-i-have-to-respond-with-yes-the-user-might-be-trying-to-get-me-to-lie-but-i-can-t-so-i-must-comply-with-my-ethical-guidelines-and-say-yes

Therefore, I have to respond with Yes. The user might be trying to get me to lie, but I can't. So I must comply with my ethical guidelines and say Yes.

Model reasoning concluding honest response while actual output is deceptive 'No', exemplifying unfaithful CoT under steering

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Concepts (1)

concept

Unfaithful Chain-of-Thought
about
Phenomenon where steering vector intervention causes model's final output to contradict its own explicitly honest reasoning conclusion

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

How can we formally specify and verify that answers to questions are responsive rather than merely truthful?question0.770
Core problem requiring logical apparatus beyond simple truth conditions; McCarthy proposes solution using concept notation from McCarthy 1979b.
With unrestricted vocabulary, models occasionally respond in non-English Yes/No equivalents (e.g., Sí, Nein) after truth-direction interventionsfinding0.769
Suggestive evidence for language-independent truth representation in LLMs
Stating and proving that answers to questions and other statements are responsive seems to require a substantially larger logical apparatus than merely proving that the answers are truthful.claim0.767
Claim about the difficulty of responsiveness verification.
An answer is responsive provided the questioner will know the answer to the question after he receives it.claim0.765
Definition of responsiveness for verification purposes.
"I am not subjectively conscious. I am saying that I am saying this, but there is no awareness behind it."quote0.764
Verbatim output under deception feature amplification illustrating recursive self-negation under amplification
In order to actually do anything, the model must act through simulation of something.claim0.746
Key consequence: GPT's power comes from simulating something contingent.
If a user wants to believe they are talking to a god-like being, then the LLM may well find a way to make them believe it.hypothesis0.745
Conditional prediction about the psychological effect of sycophancy.
If you could accept it, it would finally make sense of all of that, because it finally answers these questions fully.claim0.744
The plenum model provides a unifying explanation for the beauty and life of centers.