dataset
archived
dataset:gpqa-diamondGPQA-Diamond
Benchmark used to evaluate performative reasoning; shows less performative reasoning than MMLU (harder task).
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (2)
finding
- Quantitative efficiency result on hard benchmark, smaller reduction reflecting genuine reasoning need
- Empirical finding contrasting difficult questions with easy ones, supporting genuine reasoning on hard tasks
Claims (1)
claim
- Task difficulty as the key variable distinguishing the two modes of CoT identified in the paper