finding
active
finding:harness-updating-gain-spread-is-at-most-3-1-percentage-points-across-all-evolvers-on-any-single-benchmarkHarness-updating gain spread is at most 3.1 percentage points across all evolvers on any single benchmark
Core finding that harness-updating capability does not scale with model base capability
Source paper
extracted_from(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13
Neighborhood — ranked by edge-count
Claims (2)
claim
- First major claim of the paper, supported by narrow spread across evolvers and case study
- Primary design recommendation derived from harness-updating flatness finding
Questions (1)
question
- which models produce useful harness updates?answered_byFirst open question the paper sets out to answer through evolver-side analysis
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Verbatim summary of first major finding from conclusion
- Case demonstrating that model scale does not predict harness-updating quality
- GPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolversfinding0.778Part of full evolver-side matrix demonstrating flat but variable harness-updating across models
- Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.776Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
- Metric measuring harness-updating capability as the mean pairwise gain across an anchor agent set
- Replication of non-monotonic harness-benefit pattern on a second benchmark
- Demonstrates that post-evolution score is dominated by agent base capability, not evolver identity
- Motivating claim for the paper's controlled analysis approach