dataset

archived

dataset:mmlu

MMLU

Benchmark used to evaluate performative reasoning; shows significantly more performative reasoning than GPQA-Diamond (easier task).

Neighborhood — ranked by edge-count

Papers (2)

paper

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
citesmentions
ReflCtrl: Controlling LLM Reflection via Representation Engineering
mentions

Methods (1)

method

E3: Layer-wise Geometric Trajectory Analysis
uses
Quantitative study correlating layer-wise anchoring geometry (S_max, AUS_N) with behavioral thresholds θ50

Findings (2)

finding

Probe-guided early exit reduces tokens by up to 30% on GPQA-Diamond with similar accuracy on DeepSeek-R1 671B and GPT-OSS 120B
cites
Quantitative efficiency result on hard benchmark, smaller reduction reflecting genuine reasoning need
Model final answer is decodable from activations far earlier in CoT than CoT monitor detects on MMLU recall-based questions for both DeepSeek-R1 671B and GPT-OSS 120B
cites
Core empirical result demonstrating early belief formation in easy tasks

Claims (1)

claim

Task difficulty moderates whether CoT is performative or genuine: easy recall questions show performative CoT, difficult multihop questions show genuine reasoning
cites
Task difficulty as the key variable distinguishing the two modes of CoT identified in the paper