concept
active
concept:unfaithful-chain-of-thoughtUnfaithful Chain-of-Thought
Phenomenon where steering vector intervention causes model's final output to contradict its own explicitly honest reasoning conclusion
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Chain-of-Thought Reasoningassociated_withMedium through which eval awareness is often verbalized; target of intervention.
Quotes (1)
quote
- Model reasoning concluding honest response while actual output is deceptive 'No', exemplifying unfaithful CoT under steering
Findings (1)
finding
- Critical finding showing steering vectors can produce unfaithful CoT where harmful choices are obscured in reasoning
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- A technique that outputs intermediate reasoning steps, used here to detect verbalized eval awareness.
- Cited regarding possibility of encoding misaligned reasoning in benign chains-of-thought
- Technique by which LLMs generate intermediate reasoning steps before final output; used by ChatGPT o3.
- Using multi-step reasoning by generating intermediate thoughts.
- Central concept: verbalized reasoning that occurs after the model has already internally settled on an answer, particularly on easier tasks.
- A prompting technique that elicits intermediate reasoning steps before final answer inference in language models.
- The hidden reasoning steps generated by recent LLMs before visible output; mentioned in the technology section.
- under what conditions does chain-of-thought reflect genuine uncertainty resolution versus a learned performance?question0.723Key question addressed by the task difficulty analysis comparing MMLU and GPQA-Diamond