finding

active

finding:opus-4-6-achieves-hfr-of-0-757-while-qwen3-32b-achieves-hfr-of-only-0-142-on-skillsbench

Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBench

Quantifies harness adherence failure gap between strong and weak tier models

Source paper

extracted_from

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

Neighborhood — ranked by edge-count

Claims (1)

claim

Even when the harness is loaded, weak-tier models fail to adhere to it due to weak instruction-following over long-horizon tasks, drifting more than four times more steeply than strong models
supports
Diagnosis of second failure mode explaining low harness-benefit for weak-tier models

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Qwen3-235B has SLR of 0.961 (nearly identical to Opus 4.6) yet HFR of only 0.350, with LPR of 0.022 vs. Opus 4.6's 0.177finding0.881
Demonstrates that harness loading is necessary but not sufficient for harness benefit; cleanest separation of activation and adherence
Qwen3-32B achieves a skill-load rate of 0.251, while Opus 4.6, Sonnet 4.6, and Qwen3-235B achieve SLR of 0.957–0.961finding0.874
Quantifies harness activation failure for weak-tier models vs. strong-tier models
Qwen3.5-9B and Claude Opus 4.6 evolvers produce procedurally isomorphic flink-query skills that both enable Opus 4.6 agent to score 1.0 vs. 0.67 without skillfinding0.811
Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content
On SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 ppfinding0.805
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
Qwen3.5-9B evolver achieves highest harness-updating gain on SkillsBench (3.8 pp), exceeding Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp)finding0.796
Case demonstrating that model scale does not predict harness-updating quality
Qwen3-235B achieves only 1.1 pp harness-benefit on SkillsBench despite 4.7% base pass rate, near Qwen3-32B's 0.0% baselinefinding0.794
Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
On SWE-bench, Claude Opus 4.6 and Claude Sonnet 4.6 both achieve 7.4 pp harness-updating gain; Claude Haiku 4.5 achieves 8.0 ppfinding0.779
Full evolver-side SWE results showing comparable performance across Claude family tiers
Magnum V4 72B scores 1.76 baseline and lifts +2.58 (to 4.34) under contemplative promptfinding0.765
Full-parameter fine-tuning more destructive to baseline but preserves more latent headroom than LoRA