finding
active
finding:qwen3-5-9b-evolver-achieves-highest-harness-updating-gain-on-skillsbench-3-8-pp-exceeding-claude-opus-4-6-2-3-pp-and-qwen3-235b-1-5-ppQwen3.5-9B evolver achieves highest harness-updating gain on SkillsBench (3.8 pp), exceeding Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp)
Case demonstrating that model scale does not predict harness-updating quality
Source paper
extracted_from(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13
Neighborhood — ranked by edge-count
Claims (1)
claim
- First major claim of the paper, supported by narrow spread across evolvers and case study
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.875Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
- Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
- Verbatim summary of first major finding from conclusion
- Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content
- Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
- Full evolver-side SWE results showing comparable performance across Claude family tiers
- Shows SB low-base regime is more variable than SWE; Haiku benefits far more than Qwen3-235B despite similar base rates
- GPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolversfinding0.803Part of full evolver-side matrix demonstrating flat but variable harness-updating across models