finding
active
finding:model-final-answer-is-decodable-from-activations-far-earlier-in-cot-than-cot-monitor-detects-on-mmlu-recall-based-questions-for-both-deepseek-r1-671b-and-gpt-oss-120bModel final answer is decodable from activations far earlier in CoT than CoT monitor detects on MMLU recall-based questions for both DeepSeek-R1 671B and GPT-OSS 120B
Core empirical result demonstrating early belief formation in easy tasks
Source paper
extracted_from(2026) · Siddharth Boppana · Annabel Ma · Max Loeffler · Raphaël Sarfati +4
Neighborhood — ranked by edge-count
Claims (2)
claim
- Key comparative finding showing activation probes outperform text-level monitors for early answer detection
- The central empirical claim of the paper, supported by activation probing evidence
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Comparative finding establishing activation probing as superior to text-level monitoring for early belief detection
- Empirical finding contrasting difficult questions with easy ones, supporting genuine reasoning on hard tasks
- Task difficulty as the key variable distinguishing the two modes of CoT identified in the paper
- Shows smaller models are more sensitive to reflection reduction on non-math tasks
- Key improvement in cross-task generalization enabled by explicit instruction framing.
- Demonstrates that early-layer probes capture sentence polarity rather than truth.
- Generalization evidence that truth probes are not invariant to model instructions.
- Quantitative efficiency result on hard benchmark, smaller reduction reflecting genuine reasoning need
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.