finding
active
finding:qwen3-235b-leads-as-evolver-on-swe-bench-with-8-2-pp-harness-updating-gain-but-ranks-last-on-mcp-with-0-6-ppQwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 pp
Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
Source paper
extracted_from(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13
Neighborhood — ranked by edge-count
Claims (1)
claim
- First major claim of the paper, supported by narrow spread across evolvers and case study
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Case demonstrating that model scale does not predict harness-updating quality
- Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
- GPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolversfinding0.827Part of full evolver-side matrix demonstrating flat but variable harness-updating across models
- Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
- Full evolver-side SWE results showing comparable performance across Claude family tiers
- Verbatim summary of first major finding from conclusion
- Demonstrates that post-evolution score is dominated by agent base capability, not evolver identity
- Harness-updating gain spread is at most 3.1 percentage points across all evolvers on any single benchmarkfinding0.776Core finding that harness-updating capability does not scale with model base capability