finding
active
finding:haiku-4-5-achieves-the-largest-harness-benefit-on-skillsbench-15-1-pp-despite-mid-tier-base-capability-of-5-8Haiku 4.5 achieves the largest harness-benefit on SkillsBench (15.1 pp) despite mid-tier base capability of 5.8%
Shows SB low-base regime is more variable than SWE; Haiku benefits far more than Qwen3-235B despite similar base rates
Source paper
extracted_from(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
- Case demonstrating that model scale does not predict harness-updating quality
- Full evolver-side SWE results showing comparable performance across Claude family tiers
- Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
- Second major claim of the paper, supported by Δbenefit measurements across six models on three benchmarks
- Verbatim summary of first major finding from conclusion
- Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBenchfinding0.754Quantifies harness adherence failure gap between strong and weak tier models
- Replication of non-monotonic harness-benefit pattern on a second benchmark