finding

active

finding:on-swe-bench-claude-opus-4-6-and-claude-sonnet-4-6-both-achieve-7-4-pp-harness-updating-gain-claude-haiku-4-5-achieves-8-0-pp

On SWE-bench, Claude Opus 4.6 and Claude Sonnet 4.6 both achieve 7.4 pp harness-updating gain; Claude Haiku 4.5 achieves 8.0 pp

Full evolver-side SWE results showing comparable performance across Claude family tiers

Source paper

extracted_from

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

On SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 ppfinding0.834
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
Qwen3.5-9B evolver achieves highest harness-updating gain on SkillsBench (3.8 pp), exceeding Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp)finding0.814
Case demonstrating that model scale does not predict harness-updating quality
Qwen3-32B achieves a skill-load rate of 0.251, while Opus 4.6, Sonnet 4.6, and Qwen3-235B achieve SLR of 0.957–0.961finding0.806
Quantifies harness activation failure for weak-tier models vs. strong-tier models
Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.803
Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
Haiku 4.5 achieves the largest harness-benefit on SkillsBench (15.1 pp) despite mid-tier base capability of 5.8%finding0.802
Shows SB low-base regime is more variable than SWE; Haiku benefits far more than Qwen3-235B despite similar base rates
harness-updating is flat in base capability: models across capability tiers produce updates that yield similar gains, and even the Qwen3.5-9B evolver induces gains comparable to Claude Opus 4.6quote0.796
Verbatim summary of first major finding from conclusion
GPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolversfinding0.786
Part of full evolver-side matrix demonstrating flat but variable harness-updating across models
Sonnet 4.5 TrueSkill μ=26.4 ± 4.9 (n=14, 35.7% win rate)finding0.784
Mid-field performance with larger uncertainty due to small sample.