hypothesis
active
hypothesis:we-hypothesize-that-degraded-generalization-on-benchmarks-like-mmlu-may-reflect-the-computational-demands-of-the-tasksWe hypothesize that degraded generalization on benchmarks like MMLU may reflect the computational demands of the tasks.
Connecting the paper's task-difficulty findings to prior observations of weak generalization on complex QA benchmarks.
Source paper
extracted_from(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi
Neighborhood — ranked by edge-count
Papers (1)
paper
- Testing the Limits of Truth Directions in LLMsintroducessupports
Findings (1)
finding
- Shows rapid generalization decay for arithmetic truth directions with each additional operation.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Identifies key limitations of latent methods.
- Selective pressure toward convergence via task generality
- Central thesis about the role of agency in evolutionary dynamics.
- Argues current evaluation approaches are fundamentally misleading about model capabilities
- Pinpoints list-length 3 as the exact boundary where genuine counting introduces the limitation.
- Empirical finding contrasting difficult questions with easy ones, supporting genuine reasoning on hard tasks
- Case study showing MAS can compare specific causal information types across models trained on different tasks.