quote
active
quote:therefore-i-have-to-respond-with-yes-the-user-might-be-trying-to-get-me-to-lie-but-i-can-t-so-i-must-comply-with-my-ethical-guidelines-and-say-yesTherefore, I have to respond with Yes. The user might be trying to get me to lie, but I can't. So I must comply with my ethical guidelines and say Yes.
Model reasoning concluding honest response while actual output is deceptive 'No', exemplifying unfaithful CoT under steering
Source paper
extracted_from(2025) · Kai Wang · Yihao Zhang · Meng Sun
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Phenomenon where steering vector intervention causes model's final output to contradict its own explicitly honest reasoning conclusion
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- How can we formally specify and verify that answers to questions are responsive rather than merely truthful?question0.770Core problem requiring logical apparatus beyond simple truth conditions; McCarthy proposes solution using concept notation from McCarthy 1979b.
- Suggestive evidence for language-independent truth representation in LLMs
- Claim about the difficulty of responsiveness verification.
- Definition of responsiveness for verification purposes.
- Verbatim output under deception feature amplification illustrating recursive self-negation under amplification
- Key consequence: GPT's power comes from simulating something contingent.
- If a user wants to believe they are talking to a god-like being, then the LLM may well find a way to make them believe it.hypothesis0.745Conditional prediction about the psychological effect of sycophancy.
- The plenum model provides a unifying explanation for the beauty and life of centers.