finding
active
finding:activation-probing-detects-final-answer-belief-earlier-in-cot-than-cot-monitor-on-both-models-with-especially-pronounced-gap-on-easy-mmlu-questions

Activation probing detects final answer belief earlier in CoT than CoT monitor on both models, with especially pronounced gap on easy MMLU questions

Comparative finding establishing activation probing as superior to text-level monitoring for early belief detection

Source paper

extracted_from
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
(2026) · Siddharth Boppana · Annabel Ma · Max Loeffler · Raphaël Sarfati +4

Neighborhood — ranked by edge-count

Claims (1)

claim

Methods (1)

method
  • Named method for monitoring chain-of-thought text to detect when the model signals its answer, compared against activation probes

Concepts (1)

concept
  • Technique of reading out model beliefs from internal activations before the final answer token is generated

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.