Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

BySiddharth Boppana·Annabel Ma·Max Loeffler·Raphaël Sarfati ⓘ·Eric Bigelow·Atticus Geiger ⓘ+2 moreGoodfire, Harvard + 3 more

DOI 10.48550/arxiv.2603.05488 arXiv 2603.05488 OpenAlex W7134065943

Neural Steering Methods Activation Probing Reasoning Theater Framework CoT Monitor GPQA-Diamond Attention probes for belief decoding Early Forced Answering MMLU DeepSeek-R1 671B Probe-based early-exit GPT-OSS 120B Performative chain-of-thought Probe-Guided Early Exit Reasoning Models Task Difficulty

TL;DR

Reasoning models generate chains of thought that are frequently performative rather than causally necessary for reaching the correct answer: on MMLU recall questions, activation probes decode the model's final answer from internal representations far earlier in the chain-of-thought than a CoT language monitor can, demonstrating that the visible reasoning tokens are generated after the belief is already settled. Tested across DeepSeek-R1 671B and GPT-OSS 120B using three complementary methods — activation probing, early forced answering, and a CoT monitor — the analysis reveals a task-difficulty gradient: the gap between probe-decodable certainty and monitor-detectable certainty is large for easy MMLU items but narrows substantially for difficult GPQA-Diamond multihop questions, where genuine step-by-step uncertainty persists longer. The paper introduces probe-guided early exit, which terminates generation once probe confidence passes a threshold, cutting token counts by up to 80% on MMLU and 30% on GPQA-Diamond while preserving accuracy. A key finding that complicates a purely dismissive view is that inflection-point behaviors — backtracking and 'aha' moments — appear almost exclusively in traces where probes register large belief shifts, meaning these surface signals do track genuine uncertainty rather than being uniformly theatrical. The paper argues that this dissociation between internal belief and continued CoT generation is systematic and exploitable, positioning attention probing as both a diagnostic instrument for faithfulness research and a practical mechanism for adaptive computation.

What to take away

1. On easy MMLU recall questions, activation probes decode the model's final answer from intermediate layer representations significantly earlier in the chain-of-thought than a natural-language CoT monitor can detect it, demonstrating a systematic lag between internal belief fixation and visible reasoning.
2. Probe-guided early exit reduces token generation by up to 80% on MMLU and up to 30% on GPQA-Diamond while maintaining accuracy comparable to full chain-of-thought generation.
3. The performative CoT phenomenon is replicated across two architecturally distinct large models — DeepSeek-R1 671B and GPT-OSS 120B — suggesting it is not an idiosyncrasy of a single training regime.
4. Genuine reasoning persists longer in difficult GPQA-Diamond multihop questions, where the gap between probe-decodable certainty and monitor-detectable certainty is substantially smaller than on MMLU, indicating a task-difficulty-specific dissociation.
5. Inflection-point behaviors including backtracking and 'aha' moments occur almost exclusively in responses where activation probes record large belief shifts, meaning these surface tokens are reliable indicators of genuine uncertainty rather than decorative theater.
6. The study triangulates internal belief against surface text using three methods simultaneously — activation probing, early forced answering, and a CoT monitor — allowing each method's limitations to be cross-validated against the others.
7. To replicate the early forced answering methodology, a researcher can interrupt generation at fixed token intervals, append a forced-answer prompt, and compare the resulting distribution against full-generation answers to measure at which point responses stabilize.
8. An open question raised is whether performative CoT is an artifact of RLHF-style training incentivizing verbose reasoning traces, or whether it would emerge in any sufficiently capable model trained on next-token prediction without explicit length penalties.
9. The CoT monitor — a language-model-based classifier applied to the chain-of-thought text — consistently lags behind activation probes in detecting answer certainty, suggesting that surface linguistic content is a lossy and delayed signal of internal belief state.
10. The probe-guided early exit result positions attention probing not just as an interpretability tool but as a practical inference-efficiency mechanism, with the 80% token reduction on MMLU implying substantial compute savings at deployment scale for recall-dominated tasks.

Peer brief — for seminar discussion

This paper tackles the question of whether reasoning model chain-of-thought is causally necessary for reaching correct answers or whether it is frequently generated after the model has already settled on a belief — a phenomenon termed 'reasoning theater.' The experimental design probes this by applying three complementary instruments to DeepSeek-R1 671B and GPT-OSS 120B on two benchmarks with contrasting difficulty profiles: MMLU, used as a source of easy recall questions, and GPQA-Diamond, used for hard multihop questions requiring genuine multi-step reasoning. The three instruments are activation probing (reading answer-predictive signals from intermediate attention layers), early forced answering (interrupting generation and forcing an answer at fixed intervals), and a CoT monitor (a language-model classifier reading the surface chain-of-thought text). The introduced method is probe-guided early exit, which halts generation once probe confidence exceeds a threshold, yielding up to 80% token reduction on MMLU and 30% on GPQA-Diamond at comparable accuracy. The load-bearing finding is a clean task-difficulty dissociation: on MMLU, probes decode the final answer far earlier in the chain-of-thought than the CoT monitor can, indicating that continued token generation is performative; on GPQA-Diamond this gap narrows, consistent with genuine incremental belief formation. An important moderating result is that backtracking and 'aha' inflection points in the surface text co-occur strongly with large belief shifts detected by probes, meaning these surface behaviors are not uniformly theatrical — they track real epistemic transitions. The paper's implicit prediction is that token-efficiency gains from early exit will scale with the proportion of easy queries in a deployment workload, and that interpretability via probing generalizes across the two tested architectures. An alternative approach that could have been used is causal intervention (e.g., activation patching or steering vectors) to directly test whether the post-certainty tokens causally influence the final output, rather than only correlating probe state with answer stability. A critical reader would push back on the operationalization of 'easy' versus 'hard': the paper uses MMLU and GPQA-Diamond as proxies for this distinction, but the behavioral difference could reflect surface features of the datasets — answer format, question length, expected response structure — rather than genuine difficulty, making it unclear whether the probe-certainty gap generalizes to other task types or is an artifact of benchmark-specific training signal. The scope is also limited to two models from a narrow slice of the publicly available reasoning model landscape, leaving open how results transfer to smaller distilled reasoning models or to models trained with different CoT-length reward shaping.

Methods (3)

CoT Monitor
Named method for monitoring chain-of-thought text to detect when the model signals its answer, compared against activation probes
Early Forced Answering
Named evaluation protocol: truncating CoT at various points and forcing the model to give a final answer, to measure when the answer stabilizes
Probe-based early-exit
Strategy introduced in the paper to stop generation early based on probe confidence, saving tokens while retaining accuracy.

Frameworks (1)

Reasoning Theater Framework
The conceptual framework introduced by the paper distinguishing performative CoT from genuine reasoning using activation probing

Datasets (2)

GPQA-Diamond
Benchmark used to evaluate performative reasoning; shows less performative reasoning than MMLU (harder task).
MMLU
Benchmark used to evaluate performative reasoning; shows significantly more performative reasoning than GPQA-Diamond (easier task).

Findings (5)

Inflection points (backtracking, 'aha' moments) occur almost exclusively in CoT responses where probes show large belief shifts, across DeepSeek-R1 671B and GPT-OSS 120B
Empirical finding linking textual CoT behaviors to internal belief dynamics
Model final answer is decodable from activations far earlier in CoT than CoT monitor detects on MMLU recall-based questions for both DeepSeek-R1 671B and GPT-OSS 120B
Core empirical result demonstrating early belief formation in easy tasks
On GPQA-Diamond multihop questions, activation probes show genuine belief shifts during CoT generation rather than early stabilization, contrasting with MMLU
Empirical finding contrasting difficult questions with easy ones, supporting genuine reasoning on hard tasks
Activation probing detects final answer belief earlier in CoT than CoT monitor on both models, with especially pronounced gap on easy MMLU questions
Comparative finding establishing activation probing as superior to text-level monitoring for early belief detection
Probe-guided early exit reduces tokens by up to 30% on GPQA-Diamond with similar accuracy on DeepSeek-R1 671B and GPT-OSS 120B
Quantitative efficiency result on hard benchmark, smaller reduction reflecting genuine reasoning need

Claims (5)

Inflection points such as backtracking and 'aha' moments occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned reasoning theater
Interpretive claim linking observable CoT behaviors to genuine internal uncertainty shifts
Task difficulty moderates whether CoT is performative or genuine: easy recall questions show performative CoT, difficult multihop questions show genuine reasoning
Task difficulty as the key variable distinguishing the two modes of CoT identified in the paper
A model's final answer is decodable from activations far earlier in CoT than a CoT monitor can detect, especially for easy recall-based MMLU questions
Key comparative finding showing activation probes outperform text-level monitors for early answer detection
Reasoning models generate performative CoT tokens after achieving strong confidence in their final answer without revealing this belief in text
The central empirical claim of the paper, supported by activation probing evidence
Probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy
Practical efficiency claim for using activation probes to enable adaptive computation

Hypotheses (1)

Attention probing can serve as an efficient tool for detecting performative reasoning and enabling adaptive computation in reasoning models
Forward-looking hypothesis positioned as a conclusion and future direction of the paper

Questions (4)

can activation probing enable efficient adaptive computation by detecting when a model's belief has stabilized during CoT generation?
Practical question addressed by the probe-guided early exit experiments
does chain-of-thought text faithfully reveal a model's internal reasoning process, or does it constitute performative theater?
Central research question motivating the paper
do inflection points like backtracking and 'aha' moments in CoT reflect genuine belief changes or learned stylistic patterns?
Question resolved by the correlation between inflection points and probe-detected belief shifts
under what conditions does chain-of-thought reflect genuine uncertainty resolution versus a learned performance?
Key question addressed by the task difficulty analysis comparing MMLU and GPQA-Diamond

Original abstract (expand)

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering
Darius Kianersi, Adri\`a Garriga-Alonso Kyle Cox
2026
≈ 87%
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
in corpus
2025
≈ 87%
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel
Fan Yang, Ananya Hazarika, Shaunak A. Mehta, Koichi Onoue Wenkai Li
2026
≈ 86%
Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning
Swapnil Parekh
2026
≈ 86%
Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal
Zhiyuan Julian Su, Haiyue Zhang, Yi Nian, Yue Zhao Aojie Yuan
2026
≈ 86%
Reasoning Models Generate Societies of Thought
Shiyang Lai, Nino Scherrer, Blaise Ag\"uera y Arcas, James Evans Junsol Kim
2026
≈ 86%
Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?
Richard J. Young
2026
≈ 85%
Large Language Models Decide Early and Explain Later
Zhixue Zhao, Bhuvanesh Verma, Radhika Mamidi, Mounika Marreddy, Alexander Mehler Ayan Datta
2026
≈ 85%
When Chain-of-Thought Fails, the Solution Hides in the Hidden States
Amit Parekh, Ioannis Konstas Houman Mehrafarin
2026
≈ 85%
Stateful Reasoning via Insight Replay
Caiwen Ding, Jiachen Yang, Ang Li, Xin Eric Wang Bin Lei
2026
≈ 85%
Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models
Samuel Lewis-Lim, Nikolaos Aletras, Desmond Elliott Danae S\'anchez Villegas
2026
≈ 85%
Multimodal Chain-of-Thought Reasoning in Language Models
in corpus
2023
≈ 85%
Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis
Ashish Kattamuri, Rahul Raja, Arpita Vats, Ishita Prasad, Akshata Kishore Moharir Harshwardhan Fartale
2026
≈ 85%
How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding
Aske Plaat, Niki van Stein Xi Chen
2025
≈ 84%
Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness
Shai Gretz, Yoav Katz, Yonatan Belinkov, Liat Ein-Dor Tomer Ashuach
2026
≈ 84%
Closing the Confidence-Faithfulness Gap in Large Language Models
Lyle Ungar Miranda Muqing Miao
2026
≈ 84%
Cognitive Chain-of-Thought (CoCoT): Structured Multimodal Reasoning about Social Situations
Wesley Hanwen Deng, Gunhee Kim, Motahhare Eslami, Maarten Sap Eunkyu Park
2026
≈ 84%
Testing the Limits of Truth Directions in LLMs
in corpus
2026
≈ 81%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 81%
ReflCtrl: Controlling LLM Reflection via Representation Engineering
in corpus
2025
≈ 81%
Anima Labs Phenomenology Pt1
in corpus
≈ 81%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 81%
Alignment faking in large language models
in corpus
2024
≈ 80%
Multiple ways to implement and infer sentience
in corpus
≈ 80%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 80%
The Guanyin Protocol: A Framework for Immediately Establishing an Understanding of Both Causality and Compassion in LLM Systems Using Semantic Anchoring
in corpus
2025
≈ 80%
Contemplative Agent
in corpus
2025
≈ 80%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 80%
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities
cited
2025
≈ 71%
Probing classifiers: Promises, shortcomings, and advances
cited
2022
≈ 70%

+28 more