quote
active
quote:harness-updating-is-flat-in-base-capability-models-across-capability-tiers-produce-updates-that-yield-similar-gains-and-even-the-qwen3-5-9b-evolver-induces-gains-comparable-to-claude-opus-4-6harness-updating is flat in base capability: models across capability tiers produce updates that yield similar gains, and even the Qwen3.5-9B evolver induces gains comparable to Claude Opus 4.6
Verbatim summary of first major finding from conclusion
Source paper
extracted_from(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- First major claim of the paper, supported by narrow spread across evolvers and case study
- Case demonstrating that model scale does not predict harness-updating quality
- Primary design recommendation derived from harness-updating flatness finding
- Harness-updating gain spread is at most 3.1 percentage points across all evolvers on any single benchmarkfinding0.826Core finding that harness-updating capability does not scale with model base capability
- The capability of an evolver model to produce useful persistent harness updates from execution evidence
- Second major claim of the paper, supported by Δbenefit measurements across six models on three benchmarks
- Motivating claim for the paper's controlled analysis approach
- Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.798Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates