finding
active
finding:activation-probing-detects-final-answer-belief-earlier-in-cot-than-cot-monitor-on-both-models-with-especially-pronounced-gap-on-easy-mmlu-questionsActivation probing detects final answer belief earlier in CoT than CoT monitor on both models, with especially pronounced gap on easy MMLU questions
Comparative finding establishing activation probing as superior to text-level monitoring for early belief detection
Source paper
extracted_from(2026) · Siddharth Boppana · Annabel Ma · Max Loeffler · Raphaël Sarfati +4
Neighborhood — ranked by edge-count
Claims (1)
claim
- Key comparative finding showing activation probes outperform text-level monitors for early answer detection
Methods (1)
method
- CoT MonitorcitesNamed method for monitoring chain-of-thought text to detect when the model signals its answer, compared against activation probes
Concepts (1)
concept
- Activation ProbingcitesTechnique of reading out model beliefs from internal activations before the final answer token is generated
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core empirical result demonstrating early belief formation in easy tasks
- Empirical finding contrasting difficult questions with easy ones, supporting genuine reasoning on hard tasks
- Practical question addressed by the probe-guided early exit experiments
- Generalization evidence that truth probes are not invariant to model instructions.
- Task difficulty as the key variable distinguishing the two modes of CoT identified in the paper
- Conceptual framing: integrates mechanistic interpretability tools with alignment-focused data curation.
- Key improvement in cross-task generalization enabled by explicit instruction framing.
- Supported by the finding that non-trivial rotations are required to find aligned representations.