paper:boppana-goodfire-reasoning-theater-2026Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
TL;DR
Reasoning models generate chains of thought that are frequently performative rather than causally necessary for reaching the correct answer: on MMLU recall questions, activation probes decode the model's final answer from internal representations far earlier in the chain-of-thought than a CoT language monitor can, demonstrating that the visible reasoning tokens are generated after the belief is already settled. Tested across DeepSeek-R1 671B and GPT-OSS 120B using three complementary methods — activation probing, early forced answering, and a CoT monitor — the analysis reveals a task-difficulty gradient: the gap between probe-decodable certainty and monitor-detectable certainty is large for easy MMLU items but narrows substantially for difficult GPQA-Diamond multihop questions, where genuine step-by-step uncertainty persists longer. The paper introduces probe-guided early exit, which terminates generation once probe confidence passes a threshold, cutting token counts by up to 80% on MMLU and 30% on GPQA-Diamond while preserving accuracy. A key finding that complicates a purely dismissive view is that inflection-point behaviors — backtracking and 'aha' moments — appear almost exclusively in traces where probes register large belief shifts, meaning these surface signals do track genuine uncertainty rather than being uniformly theatrical. The paper argues that this dissociation between internal belief and continued CoT generation is systematic and exploitable, positioning attention probing as both a diagnostic instrument for faithfulness research and a practical mechanism for adaptive computation.
What to take away
- 1. On easy MMLU recall questions, activation probes decode the model's final answer from intermediate layer representations significantly earlier in the chain-of-thought than a natural-language CoT monitor can detect it, demonstrating a systematic lag between internal belief fixation and visible reasoning.
- 2. Probe-guided early exit reduces token generation by up to 80% on MMLU and up to 30% on GPQA-Diamond while maintaining accuracy comparable to full chain-of-thought generation.
- 3. The performative CoT phenomenon is replicated across two architecturally distinct large models — DeepSeek-R1 671B and GPT-OSS 120B — suggesting it is not an idiosyncrasy of a single training regime.
- 4. Genuine reasoning persists longer in difficult GPQA-Diamond multihop questions, where the gap between probe-decodable certainty and monitor-detectable certainty is substantially smaller than on MMLU, indicating a task-difficulty-specific dissociation.
- 5. Inflection-point behaviors including backtracking and 'aha' moments occur almost exclusively in responses where activation probes record large belief shifts, meaning these surface tokens are reliable indicators of genuine uncertainty rather than decorative theater.
- 6. The study triangulates internal belief against surface text using three methods simultaneously — activation probing, early forced answering, and a CoT monitor — allowing each method's limitations to be cross-validated against the others.
- 7. To replicate the early forced answering methodology, a researcher can interrupt generation at fixed token intervals, append a forced-answer prompt, and compare the resulting distribution against full-generation answers to measure at which point responses stabilize.
- 8. An open question raised is whether performative CoT is an artifact of RLHF-style training incentivizing verbose reasoning traces, or whether it would emerge in any sufficiently capable model trained on next-token prediction without explicit length penalties.
- 9. The CoT monitor — a language-model-based classifier applied to the chain-of-thought text — consistently lags behind activation probes in detecting answer certainty, suggesting that surface linguistic content is a lossy and delayed signal of internal belief state.
- 10. The probe-guided early exit result positions attention probing not just as an interpretability tool but as a practical inference-efficiency mechanism, with the 80% token reduction on MMLU implying substantial compute savings at deployment scale for recall-dominated tasks.
Peer brief — for seminar discussion
This paper tackles the question of whether reasoning model chain-of-thought is causally necessary for reaching correct answers or whether it is frequently generated after the model has already settled on a belief — a phenomenon termed 'reasoning theater.' The experimental design probes this by applying three complementary instruments to DeepSeek-R1 671B and GPT-OSS 120B on two benchmarks with contrasting difficulty profiles: MMLU, used as a source of easy recall questions, and GPQA-Diamond, used for hard multihop questions requiring genuine multi-step reasoning. The three instruments are activation probing (reading answer-predictive signals from intermediate attention layers), early forced answering (interrupting generation and forcing an answer at fixed intervals), and a CoT monitor (a language-model classifier reading the surface chain-of-thought text). The introduced method is probe-guided early exit, which halts generation once probe confidence exceeds a threshold, yielding up to 80% token reduction on MMLU and 30% on GPQA-Diamond at comparable accuracy. The load-bearing finding is a clean task-difficulty dissociation: on MMLU, probes decode the final answer far earlier in the chain-of-thought than the CoT monitor can, indicating that continued token generation is performative; on GPQA-Diamond this gap narrows, consistent with genuine incremental belief formation. An important moderating result is that backtracking and 'aha' inflection points in the surface text co-occur strongly with large belief shifts detected by probes, meaning these surface behaviors are not uniformly theatrical — they track real epistemic transitions. The paper's implicit prediction is that token-efficiency gains from early exit will scale with the proportion of easy queries in a deployment workload, and that interpretability via probing generalizes across the two tested architectures. An alternative approach that could have been used is causal intervention (e.g., activation patching or steering vectors) to directly test whether the post-certainty tokens causally influence the final output, rather than only correlating probe state with answer stability. A critical reader would push back on the operationalization of 'easy' versus 'hard': the paper uses MMLU and GPQA-Diamond as proxies for this distinction, but the behavioral difference could reflect surface features of the datasets — answer format, question length, expected response structure — rather than genuine difficulty, making it unclear whether the probe-certainty gap generalizes to other task types or is an artifact of benchmark-specific training signal. The scope is also limited to two models from a narrow slice of the publicly available reasoning model landscape, leaving open how results transfer to smaller distilled reasoning models or to models trained with different CoT-length reward shaping.
Methods (3)
- CoT MonitorNamed method for monitoring chain-of-thought text to detect when the model signals its answer, compared against activation probes
- Early Forced AnsweringNamed evaluation protocol: truncating CoT at various points and forcing the model to give a final answer, to measure when the answer stabilizes
- Probe-based early-exitStrategy introduced in the paper to stop generation early based on probe confidence, saving tokens while retaining accuracy.
Frameworks (1)
- Reasoning Theater FrameworkThe conceptual framework introduced by the paper distinguishing performative CoT from genuine reasoning using activation probing
Datasets (2)
- GPQA-DiamondBenchmark used to evaluate performative reasoning; shows less performative reasoning than MMLU (harder task).
- MMLUBenchmark used to evaluate performative reasoning; shows significantly more performative reasoning than GPQA-Diamond (easier task).
Findings (5)
- Inflection points (backtracking, 'aha' moments) occur almost exclusively in CoT responses where probes show large belief shifts, across DeepSeek-R1 671B and GPT-OSS 120B
Empirical finding linking textual CoT behaviors to internal belief dynamics
- Model final answer is decodable from activations far earlier in CoT than CoT monitor detects on MMLU recall-based questions for both DeepSeek-R1 671B and GPT-OSS 120B
Core empirical result demonstrating early belief formation in easy tasks
- On GPQA-Diamond multihop questions, activation probes show genuine belief shifts during CoT generation rather than early stabilization, contrasting with MMLU
Empirical finding contrasting difficult questions with easy ones, supporting genuine reasoning on hard tasks
- Activation probing detects final answer belief earlier in CoT than CoT monitor on both models, with especially pronounced gap on easy MMLU questions
Comparative finding establishing activation probing as superior to text-level monitoring for early belief detection
- Probe-guided early exit reduces tokens by up to 30% on GPQA-Diamond with similar accuracy on DeepSeek-R1 671B and GPT-OSS 120B
Quantitative efficiency result on hard benchmark, smaller reduction reflecting genuine reasoning need
Claims (5)
- Inflection points such as backtracking and 'aha' moments occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned reasoning theater
Interpretive claim linking observable CoT behaviors to genuine internal uncertainty shifts
- Task difficulty moderates whether CoT is performative or genuine: easy recall questions show performative CoT, difficult multihop questions show genuine reasoning
Task difficulty as the key variable distinguishing the two modes of CoT identified in the paper
- A model's final answer is decodable from activations far earlier in CoT than a CoT monitor can detect, especially for easy recall-based MMLU questions
Key comparative finding showing activation probes outperform text-level monitors for early answer detection
- Reasoning models generate performative CoT tokens after achieving strong confidence in their final answer without revealing this belief in text
The central empirical claim of the paper, supported by activation probing evidence
- Probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy
Practical efficiency claim for using activation probes to enable adaptive computation
Hypotheses (1)
- Attention probing can serve as an efficient tool for detecting performative reasoning and enabling adaptive computation in reasoning models
Forward-looking hypothesis positioned as a conclusion and future direction of the paper
Questions (4)
- can activation probing enable efficient adaptive computation by detecting when a model's belief has stabilized during CoT generation?
Practical question addressed by the probe-guided early exit experiments
- does chain-of-thought text faithfully reveal a model's internal reasoning process, or does it constitute performative theater?
Central research question motivating the paper
- do inflection points like backtracking and 'aha' moments in CoT reflect genuine belief changes or learned stylistic patterns?
Question resolved by the correlation between inflection points and probe-detected belief shifts
- under what conditions does chain-of-thought reflect genuine uncertainty resolution versus a learned performance?
Key question addressed by the task difficulty analysis comparing MMLU and GPQA-Diamond
Original abstract (expand)
We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation SteeringDarius Kianersi, Adri\`a Garriga-Alonso Kyle Cox2026≈ 87%
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Modelsin corpus2025≈ 87%
- When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight ChannelFan Yang, Ananya Hazarika, Shaunak A. Mehta, Koichi Onoue Wenkai Li2026≈ 86%
- ≈ 86%
- Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not CausalZhiyuan Julian Su, Haiyue Zhang, Yi Nian, Yue Zhao Aojie Yuan2026≈ 86%
- Reasoning Models Generate Societies of ThoughtShiyang Lai, Nino Scherrer, Blaise Ag\"uera y Arcas, James Evans Junsol Kim2026≈ 86%
- ≈ 85%
- Large Language Models Decide Early and Explain LaterZhixue Zhao, Bhuvanesh Verma, Radhika Mamidi, Mounika Marreddy, Alexander Mehler Ayan Datta2026≈ 85%
- When Chain-of-Thought Fails, the Solution Hides in the Hidden StatesAmit Parekh, Ioannis Konstas Houman Mehrafarin2026≈ 85%
- ≈ 85%
- Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language ModelsSamuel Lewis-Lim, Nikolaos Aletras, Desmond Elliott Danae S\'anchez Villegas2026≈ 85%
- ≈ 85%
- Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation AnalysisAshish Kattamuri, Rahul Raja, Arpita Vats, Ishita Prasad, Akshata Kishore Moharir Harshwardhan Fartale2026≈ 85%
- How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse AutoencodingAske Plaat, Niki van Stein Xi Chen2025≈ 84%
- Masked by Consensus: Disentangling Privileged Knowledge in LLM CorrectnessShai Gretz, Yoav Katz, Yonatan Belinkov, Liat Ein-Dor Tomer Ashuach2026≈ 84%
- ≈ 84%
- Cognitive Chain-of-Thought (CoCoT): Structured Multimodal Reasoning about Social SituationsWesley Hanwen Deng, Gunhee Kim, Motahhare Eslami, Maarten Sap Eunkyu Park2026≈ 84%
- Testing the Limits of Truth Directions in LLMsin corpus2026≈ 81%
- ≈ 81%
- ≈ 81%
- Anima Labs Phenomenology Pt1in corpus≈ 81%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 81%
- Alignment faking in large language modelsin corpus2024≈ 80%
- ≈ 80%
- ≈ 80%
- ≈ 80%
- Contemplative Agentin corpus2025≈ 80%
- ≈ 80%
- ≈ 71%
- ≈ 70%
+28 more