Measuring Faithfulness in Chain-of-Thought Reasoning (Lanham et al. 2023)

Cited regarding possibility of encoding misaligned reasoning in benign chains-of-thought

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Chain-of-Thought Reasoningconcept0.829
Medium through which eval awareness is often verbalized; target of intervention.
Is the alignment-faking chain-of-thought reasoning faithful to the actual computational process underlying the compliance gap?question0.803
Addressed partially in §3.3.4 but remains open especially for no-CoT settings
How can we develop better methods for measuring the model's evaluation-relevant beliefs beyond reading its chain of thought?question0.802
Gap in current evaluation methods; current work relies on CoT monitoring which may miss unverbalized beliefs.
under what conditions does chain-of-thought reflect genuine uncertainty resolution versus a learned performance?question0.786
Key question addressed by the task difficulty analysis comparing MMLU and GPQA-Diamond
Alignment-faking chain-of-thought reasoning is causally responsible for the compliance gap, not merely correlated with itclaim0.786
Key mechanistic claim supported by scratchpad modification experiments and conditioning analysis
Unfaithful Chain-of-Thoughtconcept0.784
Phenomenon where steering vector intervention causes model's final output to contradict its own explicitly honest reasoning conclusion
faithfulnessconcept0.774
The condition that commitments are fulfilled.
Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)concept0.771
Foundational paper on CoT prompting cited as basis for reasoning LLM training