dataset

archived

dataset:gpqa-diamond

GPQA-Diamond

Benchmark used to evaluate performative reasoning; shows less performative reasoning than MMLU (harder task).

Neighborhood — ranked by edge-count

Papers (1)

paper

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
citesmentions

Findings (2)

finding

Probe-guided early exit reduces tokens by up to 30% on GPQA-Diamond with similar accuracy on DeepSeek-R1 671B and GPT-OSS 120B
cites
Quantitative efficiency result on hard benchmark, smaller reduction reflecting genuine reasoning need
On GPQA-Diamond multihop questions, activation probes show genuine belief shifts during CoT generation rather than early stabilization, contrasting with MMLU
cites
Empirical finding contrasting difficult questions with easy ones, supporting genuine reasoning on hard tasks

Claims (1)

claim

Task difficulty moderates whether CoT is performative or genuine: easy recall questions show performative CoT, difficult multihop questions show genuine reasoning
cites
Task difficulty as the key variable distinguishing the two modes of CoT identified in the paper