finding
active
finding:gpt-oss-120b-achieves-5-9-pp-harness-updating-gain-on-swe-bench-lowest-among-all-seven-evolversGPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolvers
Part of full evolver-side matrix demonstrating flat but variable harness-updating across models
Source paper
extracted_from(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- GPT-OSS-120B adherence drops from 0.67 after harness loading to 0.43 at final validation (drift of -0.24)finding0.840Mid-tier model shows moderate adherence drift compared to weak and strong tiers
- Replication of non-monotonic harness-benefit pattern on a second benchmark
- Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.827Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
- Mid-tier model showing intermediate activation rate between weak and strong tiers
- Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
- Case demonstrating that model scale does not predict harness-updating quality
- Verbatim summary of first major finding from conclusion
- Full evolver-side SWE results showing comparable performance across Claude family tiers