finding

active

finding:harness-updating-gain-spread-is-at-most-3-1-percentage-points-across-all-evolvers-on-any-single-benchmark

Harness-updating gain spread is at most 3.1 percentage points across all evolvers on any single benchmark

Core finding that harness-updating capability does not scale with model base capability

Source paper

extracted_from

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

Neighborhood — ranked by edge-count

Claims (2)

claim

Harness-updating capability is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains
supports
First major claim of the paper, supported by narrow spread across evolvers and case study
Capability budget should be allocated to the task-solving agent rather than the evolver, since harness-updating varies by at most 3.1 pp across evolvers
supports
Primary design recommendation derived from harness-updating flatness finding

Questions (1)

question

which models produce useful harness updates?
answered_by
First open question the paper sets out to answer through evolver-side analysis

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

harness-updating is flat in base capability: models across capability tiers produce updates that yield similar gains, and even the Qwen3.5-9B evolver induces gains comparable to Claude Opus 4.6quote0.826
Verbatim summary of first major finding from conclusion
Qwen3.5-9B evolver achieves highest harness-updating gain on SkillsBench (3.8 pp), exceeding Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp)finding0.783
Case demonstrating that model scale does not predict harness-updating quality
GPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolversfinding0.778
Part of full evolver-side matrix demonstrating flat but variable harness-updating across models
Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.776
Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
Harness-Updating Gain (Δupdate)method0.765
Metric measuring harness-updating capability as the mean pairwise gain across an anchor agent set
On MCP-Atlas, harness-benefit peaks at GPT-OSS-120B (7.0 pp), with lower gains at both ends of the base-capability scalefinding0.738
Replication of non-monotonic harness-benefit pattern on a second benchmark
Within-agent spread across seven evolvers is at most 5.1 pp (Qwen3-235B on MCP), small against the 36.0 pp gap between Opus and Qwen3-235B base capabilitiesfinding0.737
Demonstrates that post-evolution score is dominated by agent base capability, not evolver identity
End-to-end evaluation scores conflate three sources of improvement: base capability, harness-updating quality, and harness-benefit, leaving it unclear which models produce useful updates or benefit most from themclaim0.729
Motivating claim for the paper's controlled analysis approach