finding
active
finding:qwen3-32b-achieves-a-skill-load-rate-of-0-251-while-opus-4-6-sonnet-4-6-and-qwen3-235b-achieve-slr-of-0-957-0-961Qwen3-32B achieves a skill-load rate of 0.251, while Opus 4.6, Sonnet 4.6, and Qwen3-235B achieve SLR of 0.957–0.961
Quantifies harness activation failure for weak-tier models vs. strong-tier models
Source paper
extracted_from(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13
Neighborhood — ranked by edge-count
Claims (1)
claim
- Diagnosis of first failure mode explaining low harness-benefit for weak-tier models
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBenchfinding0.874Quantifies harness adherence failure gap between strong and weak tier models
- Demonstrates that harness loading is necessary but not sufficient for harness benefit; cleanest separation of activation and adherence
- Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
- Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content
- Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
- Full evolver-side SWE results showing comparable performance across Claude family tiers
- Qwen 35B (3B active params, score 4.38) outscores Hermes 405B (405B active params, score 1.75) by 2.5xfinding0.805Parameters don't predict scores; 135x more parameters yields 60% lower score
- Case demonstrating that model scale does not predict harness-updating quality