finding
active
finding:on-swe-bench-claude-opus-4-6-and-claude-sonnet-4-6-both-achieve-7-4-pp-harness-updating-gain-claude-haiku-4-5-achieves-8-0-ppOn SWE-bench, Claude Opus 4.6 and Claude Sonnet 4.6 both achieve 7.4 pp harness-updating gain; Claude Haiku 4.5 achieves 8.0 pp
Full evolver-side SWE results showing comparable performance across Claude family tiers
Source paper
extracted_from(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
- Case demonstrating that model scale does not predict harness-updating quality
- Quantifies harness activation failure for weak-tier models vs. strong-tier models
- Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.803Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
- Shows SB low-base regime is more variable than SWE; Haiku benefits far more than Qwen3-235B despite similar base rates
- Verbatim summary of first major finding from conclusion
- GPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolversfinding0.786Part of full evolver-side matrix demonstrating flat but variable harness-updating across models
- Mid-field performance with larger uncertainty due to small sample.