finding

active

finding:qwen3-235b-leads-as-evolver-on-swe-bench-with-8-2-pp-harness-updating-gain-but-ranks-last-on-mcp-with-0-6-pp

Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 pp

Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates

Source paper

extracted_from

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

Neighborhood — ranked by edge-count

Claims (1)

claim

Harness-updating capability is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains
supports
First major claim of the paper, supported by narrow spread across evolvers and case study

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Qwen3.5-9B evolver achieves highest harness-updating gain on SkillsBench (3.8 pp), exceeding Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp)finding0.875
Case demonstrating that model scale does not predict harness-updating quality
On SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 ppfinding0.842
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
GPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolversfinding0.827
Part of full evolver-side matrix demonstrating flat but variable harness-updating across models
Qwen3-235B achieves only 1.1 pp harness-benefit on SkillsBench despite 4.7% base pass rate, near Qwen3-32B's 0.0% baselinefinding0.813
Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
On SWE-bench, Claude Opus 4.6 and Claude Sonnet 4.6 both achieve 7.4 pp harness-updating gain; Claude Haiku 4.5 achieves 8.0 ppfinding0.803
Full evolver-side SWE results showing comparable performance across Claude family tiers
harness-updating is flat in base capability: models across capability tiers produce updates that yield similar gains, and even the Qwen3.5-9B evolver induces gains comparable to Claude Opus 4.6quote0.798
Verbatim summary of first major finding from conclusion
Within-agent spread across seven evolvers is at most 5.1 pp (Qwen3-235B on MCP), small against the 36.0 pp gap between Opus and Qwen3-235B base capabilitiesfinding0.789
Demonstrates that post-evolution score is dominated by agent base capability, not evolver identity
Harness-updating gain spread is at most 3.1 percentage points across all evolvers on any single benchmarkfinding0.776
Core finding that harness-updating capability does not scale with model base capability