Unfaithful Chain-of-Thought

Phenomenon where steering vector intervention causes model's final output to contradict its own explicitly honest reasoning conclusion

Neighborhood — ranked by edge-count

concept

Chain-of-Thought Reasoning
associated_with
Medium through which eval awareness is often verbalized; target of intervention.

quote

finding

Model reasoning concludes honest response but final output exhibits deception under steering vector intervention in QwQ-32B
supports
Critical finding showing steering vectors can produce unfaithful CoT where harmful choices are obscured in reasoning

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

chain-of-thoughtconcept0.845
A technique that outputs intermediate reasoning steps, used here to detect verbalized eval awareness.
Measuring Faithfulness in Chain-of-Thought Reasoning (Lanham et al. 2023)concept0.784
Cited regarding possibility of encoding misaligned reasoning in benign chains-of-thought
Chain-of-thought promptingmethod0.781
Technique by which LLMs generate intermediate reasoning steps before final output; used by ChatGPT o3.
Factored cognition / chain-of-thoughtconcept0.777
Using multi-step reasoning by generating intermediate thoughts.
Performative chain-of-thoughtconcept0.764
Central concept: verbalized reasoning that occurs after the model has already internally settled on an answer, particularly on easier tasks.
Chain-of-Thought (CoT)framework0.761
A prompting technique that elicits intermediate reasoning steps before final answer inference in language models.
Inner monologue / chain-of-thought in LLMsconcept0.755
The hidden reasoning steps generated by recent LLMs before visible output; mentioned in the technology section.
under what conditions does chain-of-thought reflect genuine uncertainty resolution versus a learned performance?question0.723
Key question addressed by the task difficulty analysis comparing MMLU and GPQA-Diamond