finding

active

finding:haiku-4-5-achieves-the-largest-harness-benefit-on-skillsbench-15-1-pp-despite-mid-tier-base-capability-of-5-8

Haiku 4.5 achieves the largest harness-benefit on SkillsBench (15.1 pp) despite mid-tier base capability of 5.8%

Shows SB low-base regime is more variable than SWE; Haiku benefits far more than Qwen3-235B despite similar base rates

Source paper

extracted_from

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Qwen3-235B achieves only 1.1 pp harness-benefit on SkillsBench despite 4.7% base pass rate, near Qwen3-32B's 0.0% baselinefinding0.811
Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
Qwen3.5-9B evolver achieves highest harness-updating gain on SkillsBench (3.8 pp), exceeding Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp)finding0.803
Case demonstrating that model scale does not predict harness-updating quality
On SWE-bench, Claude Opus 4.6 and Claude Sonnet 4.6 both achieve 7.4 pp harness-updating gain; Claude Haiku 4.5 achieves 8.0 ppfinding0.802
Full evolver-side SWE results showing comparable performance across Claude family tiers
On SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 ppfinding0.777
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
Harness-benefit is non-monotonic in base capability: weak-tier models benefit little, mid-tier models benefit most, and strong-tier models benefit less than mid-tierclaim0.766
Second major claim of the paper, supported by Δbenefit measurements across six models on three benchmarks
harness-updating is flat in base capability: models across capability tiers produce updates that yield similar gains, and even the Qwen3.5-9B evolver induces gains comparable to Claude Opus 4.6quote0.760
Verbatim summary of first major finding from conclusion
Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBenchfinding0.754
Quantifies harness adherence failure gap between strong and weak tier models
On MCP-Atlas, harness-benefit peaks at GPT-OSS-120B (7.0 pp), with lower gains at both ends of the base-capability scalefinding0.752
Replication of non-monotonic harness-benefit pattern on a second benchmark