finding

active

finding:qwen3-5-9b-evolver-achieves-highest-harness-updating-gain-on-skillsbench-3-8-pp-exceeding-claude-opus-4-6-2-3-pp-and-qwen3-235b-1-5-pp

Qwen3.5-9B evolver achieves highest harness-updating gain on SkillsBench (3.8 pp), exceeding Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp)

Case demonstrating that model scale does not predict harness-updating quality

Source paper

extracted_from

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

Neighborhood — ranked by edge-count

Claims (1)

claim

Harness-updating capability is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains
supports
First major claim of the paper, supported by narrow spread across evolvers and case study

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.875
Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
Qwen3-235B achieves only 1.1 pp harness-benefit on SkillsBench despite 4.7% base pass rate, near Qwen3-32B's 0.0% baselinefinding0.862
Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
harness-updating is flat in base capability: models across capability tiers produce updates that yield similar gains, and even the Qwen3.5-9B evolver induces gains comparable to Claude Opus 4.6quote0.851
Verbatim summary of first major finding from conclusion
Qwen3.5-9B and Claude Opus 4.6 evolvers produce procedurally isomorphic flink-query skills that both enable Opus 4.6 agent to score 1.0 vs. 0.67 without skillfinding0.838
Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content
On SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 ppfinding0.819
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
On SWE-bench, Claude Opus 4.6 and Claude Sonnet 4.6 both achieve 7.4 pp harness-updating gain; Claude Haiku 4.5 achieves 8.0 ppfinding0.814
Full evolver-side SWE results showing comparable performance across Claude family tiers
Haiku 4.5 achieves the largest harness-benefit on SkillsBench (15.1 pp) despite mid-tier base capability of 5.8%finding0.803
Shows SB low-base regime is more variable than SWE; Haiku benefits far more than Qwen3-235B despite similar base rates
GPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolversfinding0.803
Part of full evolver-side matrix demonstrating flat but variable harness-updating across models