question

active

question:under-what-conditions-does-chain-of-thought-reflect-genuine-uncertainty-resolution-versus-a-learned-performance

under what conditions does chain-of-thought reflect genuine uncertainty resolution versus a learned performance?

Key question addressed by the task difficulty analysis comparing MMLU and GPQA-Diamond

Source paper

extracted_from

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

(2026) · Siddharth Boppana · Annabel Ma · Max Loeffler · Raphaël Sarfati +4

Neighborhood — ranked by edge-count

Claims (1)

claim

Task difficulty moderates whether CoT is performative or genuine: easy recall questions show performative CoT, difficult multihop questions show genuine reasoning
gates
Task difficulty as the key variable distinguishing the two modes of CoT identified in the paper

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

does chain-of-thought text faithfully reveal a model's internal reasoning process, or does it constitute performative theater?question0.815
Central research question motivating the paper
A small number of high-quality human demonstrations of chain-of-thought reasoning could be used to improve and focus performance.hypothesis0.810
Section 6 mentions high-quality human demos could improve natural language feedback.
Chain-of-Thought Reasoningconcept0.801
Medium through which eval awareness is often verbalized; target of intervention.
When LLMs produce experience claims under self-reference, is this sophisticated simulation or genuine self-representation, and how would we tell the difference?question0.787
The core interpretive question the paper narrows but cannot definitively answer
Measuring Faithfulness in Chain-of-Thought Reasoning (Lanham et al. 2023)concept0.786
Cited regarding possibility of encoding misaligned reasoning in benign chains-of-thought
Performative chain-of-thought is real; verbalized output does not equal internal state.claim0.784
Chain-of-thought reasoning improves large model accuracy on HHH binary comparisons, reaching ~78% for 52B model, competitive with human-feedback PM.finding0.776
Figure 4 shows CoT improves over zero-shot, and ensembled CoT further boosts accuracy.
Chain-of-thought reasoning improves the transparency and performance of AI decision making in harmlessness evaluation.claim0.776
CoT improves accuracy on HHH evals and makes the decision process legible.