which models produce useful harness updates?

First open question the paper sets out to answer through evolver-side analysis

Source paper

extracted_from

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

Neighborhood — ranked by edge-count

Findings (1)

finding

Harness-updating gain spread is at most 3.1 percentage points across all evolvers on any single benchmark
answered_by
Core finding that harness-updating capability does not scale with model base capability

Claims (1)

claim

End-to-end evaluation scores conflate three sources of improvement: base capability, harness-updating quality, and harness-benefit, leaving it unclear which models produce useful updates or benefit most from them
gates
Motivating claim for the paper's controlled analysis approach

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

which models actually benefit from updated harnesses?question0.895
Second open question the paper sets out to answer through agent-side analysis
Harness-Updating Capabilityconcept0.814
The capability of an evolver model to produce useful persistent harness updates from execution evidence
Harness-updating capability is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gainsclaim0.801
First major claim of the paper, supported by narrow spread across evolvers and case study
harness-updating is flat in base capability: models across capability tiers produce updates that yield similar gains, and even the Qwen3.5-9B evolver induces gains comparable to Claude Opus 4.6quote0.795
Verbatim summary of first major finding from conclusion
Harness-Updating Gain (Δupdate)method0.752
Metric measuring harness-updating capability as the mean pairwise gain across an anchor agent set
Harness-benefit is non-monotonic in base capability: weak-tier models benefit little, mid-tier models benefit most, and strong-tier models benefit less than mid-tierclaim0.743
Second major claim of the paper, supported by Δbenefit measurements across six models on three benchmarks
what explains why weak-tier models with the most performance headroom benefit least from harness evolution?question0.742
In-depth diagnostic question addressed by the two failure mode analysis
Harness-Benefit Capabilityconcept0.732
The capability of a task-solving agent to benefit from updated harnesses during task solving