finding
active
finding:model-final-answer-is-decodable-from-activations-far-earlier-in-cot-than-cot-monitor-detects-on-mmlu-recall-based-questions-for-both-deepseek-r1-671b-and-gpt-oss-120b

Model final answer is decodable from activations far earlier in CoT than CoT monitor detects on MMLU recall-based questions for both DeepSeek-R1 671B and GPT-OSS 120B

Core empirical result demonstrating early belief formation in easy tasks

Source paper

extracted_from
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
(2026) · Siddharth Boppana · Annabel Ma · Max Loeffler · Raphaël Sarfati +4

Neighborhood — ranked by edge-count

Claims (2)

claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.