finding

active

finding:model-final-answer-is-decodable-from-activations-far-earlier-in-cot-than-cot-monitor-detects-on-mmlu-recall-based-questions-for-both-deepseek-r1-671b-and-gpt-oss-120b

Model final answer is decodable from activations far earlier in CoT than CoT monitor detects on MMLU recall-based questions for both DeepSeek-R1 671B and GPT-OSS 120B

Core empirical result demonstrating early belief formation in easy tasks

Source paper

extracted_from

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

(2026) · Siddharth Boppana · Annabel Ma · Max Loeffler · Raphaël Sarfati +4

Neighborhood — ranked by edge-count

Claims (2)

claim

A model's final answer is decodable from activations far earlier in CoT than a CoT monitor can detect, especially for easy recall-based MMLU questions
restatessupports
Key comparative finding showing activation probes outperform text-level monitors for early answer detection
Reasoning models generate performative CoT tokens after achieving strong confidence in their final answer without revealing this belief in text
supports
The central empirical claim of the paper, supported by activation probing evidence

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation probing detects final answer belief earlier in CoT than CoT monitor on both models, with especially pronounced gap on easy MMLU questionsfinding0.848
Comparative finding establishing activation probing as superior to text-level monitoring for early belief detection
On GPQA-Diamond multihop questions, activation probes show genuine belief shifts during CoT generation rather than early stabilization, contrasting with MMLUfinding0.774
Empirical finding contrasting difficult questions with easy ones, supporting genuine reasoning on hard tasks
Task difficulty moderates whether CoT is performative or genuine: easy recall questions show performative CoT, difficult multihop questions show genuine reasoningclaim0.754
Task difficulty as the key variable distinguishing the two modes of CoT identified in the paper
DeepSeek-R1 Llama 8b accuracy on MMLU Professional Accounting drops from 56.5% at baseline to 50.1% at intervention -0.96finding0.753
Shows smaller models are more sensitive to reflection reduction on non-math tasks
Under ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.finding0.748
Key improvement in cross-task generalization enabled by explicit instruction framing.
F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.finding0.744
Demonstrates that early-layer probes capture sentence polarity rather than truth.
No-prompt probes show significant AUROC performance drop when evaluated on ask-correct activations, especially at layers where arithmetic truth directions emerge under no-prompt.finding0.743
Generalization evidence that truth probes are not invariant to model instructions.
Probe-guided early exit reduces tokens by up to 30% on GPQA-Diamond with similar accuracy on DeepSeek-R1 671B and GPT-OSS 120Bfinding0.743
Quantitative efficiency result on hard benchmark, smaller reduction reflecting genuine reasoning need

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

claim
A model's final answer is decodable from activations far earlier in CoT than a CoT monitor can detect, especially for easy recall-based MMLU questions