finding

active

finding:activation-probing-detects-final-answer-belief-earlier-in-cot-than-cot-monitor-on-both-models-with-especially-pronounced-gap-on-easy-mmlu-questions

Activation probing detects final answer belief earlier in CoT than CoT monitor on both models, with especially pronounced gap on easy MMLU questions

Comparative finding establishing activation probing as superior to text-level monitoring for early belief detection

Source paper

extracted_from

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

(2026) · Siddharth Boppana · Annabel Ma · Max Loeffler · Raphaël Sarfati +4

Neighborhood — ranked by edge-count

Claims (1)

claim

A model's final answer is decodable from activations far earlier in CoT than a CoT monitor can detect, especially for easy recall-based MMLU questions
supports
Key comparative finding showing activation probes outperform text-level monitors for early answer detection

Methods (1)

method

CoT Monitor
cites
Named method for monitoring chain-of-thought text to detect when the model signals its answer, compared against activation probes

Concepts (1)

concept

Activation Probing
cites
Technique of reading out model beliefs from internal activations before the final answer token is generated

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Model final answer is decodable from activations far earlier in CoT than CoT monitor detects on MMLU recall-based questions for both DeepSeek-R1 671B and GPT-OSS 120Bfinding0.848
Core empirical result demonstrating early belief formation in easy tasks
On GPQA-Diamond multihop questions, activation probes show genuine belief shifts during CoT generation rather than early stabilization, contrasting with MMLUfinding0.830
Empirical finding contrasting difficult questions with easy ones, supporting genuine reasoning on hard tasks
can activation probing enable efficient adaptive computation by detecting when a model's belief has stabilized during CoT generation?question0.820
Practical question addressed by the probe-guided early exit experiments
No-prompt probes show significant AUROC performance drop when evaluated on ask-correct activations, especially at layers where arithmetic truth directions emerge under no-prompt.finding0.781
Generalization evidence that truth probes are not invariant to model instructions.
Task difficulty moderates whether CoT is performative or genuine: easy recall questions show performative CoT, difficult multihop questions show genuine reasoningclaim0.769
Task difficulty as the key variable distinguishing the two modes of CoT identified in the paper
Probe-based data attribution bridges interpretability work (probes/activations) with data-centric alignment.claim0.756
Conceptual framing: integrates mechanistic interpretability tools with alignment-focused data curation.
Under ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.finding0.756
Key improvement in cross-task generalization enabled by explicit instruction framing.
Direct probes over learned activations in standard basis may fail to reveal the actual causal role of representations because they are highly distributedclaim0.755
Supported by the finding that non-trivial rotations are required to find aligned representations.