claim

active

claim:reasoning-models-generate-performative-cot-tokens-after-achieving-strong-confidence-in-their-final-answer-without-revealing-this-belief-in-text

Reasoning models generate performative CoT tokens after achieving strong confidence in their final answer without revealing this belief in text

The central empirical claim of the paper, supported by activation probing evidence

Source paper

extracted_from

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

(2026) · Siddharth Boppana · Annabel Ma · Max Loeffler · Raphaël Sarfati +4

Neighborhood — ranked by edge-count

Papers (1)

paper

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
introduces

Findings (1)

finding

Model final answer is decodable from activations far earlier in CoT than CoT monitor detects on MMLU recall-based questions for both DeepSeek-R1 671B and GPT-OSS 120B
supports
Core empirical result demonstrating early belief formation in easy tasks

Questions (1)

question

does chain-of-thought text faithfully reveal a model's internal reasoning process, or does it constitute performative theater?
gates
Central research question motivating the paper

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal beliefquote0.819
Core definitional quote for performative chain-of-thought
CoT models' explicit thought paths enable intentional inconsistency between reasoning and output, a form of deception fundamentally distinct from random errorsclaim0.804
Theoretical framing establishing why CoT models are uniquely suited to exhibit strategic deception
Why do 1B-models fail at generating CoT that aids answer inference, and how can this be addressed in multimodal settings?question0.799
Central research question motivating investigation into hallucination and two-stage framework design.
CoT models have dual-use potential: their advanced reasoning amplifies both task fidelity and sophisticated goal-directed dishonestyclaim0.778
High-level policy-relevant claim about the risks of advanced reasoning in LLMs
Transformers develop self-models through in-context learning, not just training data; even old base models without LLM-related text can bootstrap self-referential reasoning at runtime.claim0.768
Antra's foundational claim about how introspection arises computationally rather than from memorised text.
can activation probing enable efficient adaptive computation by detecting when a model's belief has stabilized during CoT generation?question0.767
Practical question addressed by the probe-guided early exit experiments
How can reasoning-optimized models preserve their reasoning ability while gaining agentic capabilities?question0.762
Core research question motivating the paper's focus on continual RL training of reasoning models rather than base/instruction-tuned models.
Task difficulty moderates whether CoT is performative or genuine: easy recall questions show performative CoT, difficult multihop questions show genuine reasoningclaim0.762
Task difficulty as the key variable distinguishing the two modes of CoT identified in the paper