dataset
archived
dataset:gpqa-diamond

GPQA-Diamond

Benchmark used to evaluate performative reasoning; shows less performative reasoning than MMLU (harder task).

Neighborhood — ranked by edge-count

Findings (2)

finding

Claims (1)

claim