finding

active

finding:qwen3-235b-achieves-only-1-1-pp-harness-benefit-on-skillsbench-despite-4-7-base-pass-rate-near-qwen3-32b-s-0-0-baseline

Qwen3-235B achieves only 1.1 pp harness-benefit on SkillsBench despite 4.7% base pass rate, near Qwen3-32B's 0.0% baseline

Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit

Source paper

extracted_from

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Qwen3.5-9B evolver achieves highest harness-updating gain on SkillsBench (3.8 pp), exceeding Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp)finding0.862
Case demonstrating that model scale does not predict harness-updating quality
On SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 ppfinding0.856
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.813
Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
Qwen3-32B achieves a skill-load rate of 0.251, while Opus 4.6, Sonnet 4.6, and Qwen3-235B achieve SLR of 0.957–0.961finding0.811
Quantifies harness activation failure for weak-tier models vs. strong-tier models
Haiku 4.5 achieves the largest harness-benefit on SkillsBench (15.1 pp) despite mid-tier base capability of 5.8%finding0.811
Shows SB low-base regime is more variable than SWE; Haiku benefits far more than Qwen3-235B despite similar base rates
Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBenchfinding0.794
Quantifies harness adherence failure gap between strong and weak tier models
Qwen3-32B adherence drops from 0.52 after harness loading to 0.13 at final validation (drift of -0.39)finding0.779
Demonstrates long-horizon instruction-following bottleneck for weak-tier models
Qwen 35B (3B active params, score 4.38) outscores Hermes 405B (405B active params, score 1.75) by 2.5xfinding0.757
Parameters don't predict scores; 135x more parameters yields 60% lower score