finding
active
finding:on-swe-bench-harness-benefit-peaks-at-qwen3-235b-19-3-pp-while-weaker-qwen3-32b-gains-only-4-4-pp-and-stronger-opus-4-6-gains-only-2-6-ppOn SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 pp
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
Source paper
extracted_from(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13
Neighborhood — ranked by edge-count
Claims (1)
claim
- Second major claim of the paper, supported by Δbenefit measurements across six models on three benchmarks
Questions (1)
question
- Second open question the paper sets out to answer through agent-side analysis
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
- Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.842Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
- Replication of non-monotonic harness-benefit pattern on a second benchmark
- Full evolver-side SWE results showing comparable performance across Claude family tiers
- Case demonstrating that model scale does not predict harness-updating quality
- Quantifies harness activation failure for weak-tier models vs. strong-tier models
- GPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolversfinding0.809Part of full evolver-side matrix demonstrating flat but variable harness-updating across models
- Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBenchfinding0.805Quantifies harness adherence failure gap between strong and weak tier models