finding

active

finding:qwen3-32b-achieves-a-skill-load-rate-of-0-251-while-opus-4-6-sonnet-4-6-and-qwen3-235b-achieve-slr-of-0-957-0-961

Qwen3-32B achieves a skill-load rate of 0.251, while Opus 4.6, Sonnet 4.6, and Qwen3-235B achieve SLR of 0.957–0.961

Quantifies harness activation failure for weak-tier models vs. strong-tier models

Source paper

extracted_from

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

Neighborhood — ranked by edge-count

Claims (1)

claim

Weak-tier models often fail to invoke relevant harness artifacts during task-solving, with Qwen3-32B showing a 25% load rate against ~96% for strong models
supports
Diagnosis of first failure mode explaining low harness-benefit for weak-tier models

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBenchfinding0.874
Quantifies harness adherence failure gap between strong and weak tier models
Qwen3-235B has SLR of 0.961 (nearly identical to Opus 4.6) yet HFR of only 0.350, with LPR of 0.022 vs. Opus 4.6's 0.177finding0.873
Demonstrates that harness loading is necessary but not sufficient for harness benefit; cleanest separation of activation and adherence
On SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 ppfinding0.814
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
Qwen3.5-9B and Claude Opus 4.6 evolvers produce procedurally isomorphic flink-query skills that both enable Opus 4.6 agent to score 1.0 vs. 0.67 without skillfinding0.812
Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content
Qwen3-235B achieves only 1.1 pp harness-benefit on SkillsBench despite 4.7% base pass rate, near Qwen3-32B's 0.0% baselinefinding0.811
Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
On SWE-bench, Claude Opus 4.6 and Claude Sonnet 4.6 both achieve 7.4 pp harness-updating gain; Claude Haiku 4.5 achieves 8.0 ppfinding0.806
Full evolver-side SWE results showing comparable performance across Claude family tiers
Qwen 35B (3B active params, score 4.38) outscores Hermes 405B (405B active params, score 1.75) by 2.5xfinding0.805
Parameters don't predict scores; 135x more parameters yields 60% lower score
Qwen3.5-9B evolver achieves highest harness-updating gain on SkillsBench (3.8 pp), exceeding Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp)finding0.801
Case demonstrating that model scale does not predict harness-updating quality