finding

active

finding:qwen3-235b-has-slr-of-0-961-nearly-identical-to-opus-4-6-yet-hfr-of-only-0-350-with-lpr-of-0-022-vs-opus-4-6-s-0-177

Qwen3-235B has SLR of 0.961 (nearly identical to Opus 4.6) yet HFR of only 0.350, with LPR of 0.022 vs. Opus 4.6's 0.177

Demonstrates that harness loading is necessary but not sufficient for harness benefit; cleanest separation of activation and adherence

Source paper

extracted_from

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

Neighborhood — ranked by edge-count

Claims (2)

claim

Even when the harness is loaded, weak-tier models fail to adhere to it due to weak instruction-following over long-horizon tasks, drifting more than four times more steeply than strong models
supports
Diagnosis of second failure mode explaining low harness-benefit for weak-tier models
Loading the harness is not sufficient for benefiting from it: a model with near-ceiling SLR can still have low HFR and LPR
supports
Derived from Qwen3-235B's dissociation between SLR (0.961) and HFR (0.350)

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBenchfinding0.881
Quantifies harness adherence failure gap between strong and weak tier models
Qwen3-32B achieves a skill-load rate of 0.251, while Opus 4.6, Sonnet 4.6, and Qwen3-235B achieve SLR of 0.957–0.961finding0.873
Quantifies harness activation failure for weak-tier models vs. strong-tier models
On SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 ppfinding0.796
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
Qwen 35B (3B active params, score 4.38) outscores Hermes 405B (405B active params, score 1.75) by 2.5xfinding0.795
Parameters don't predict scores; 135x more parameters yields 60% lower score
Qwen3.5-9B and Claude Opus 4.6 evolvers produce procedurally isomorphic flink-query skills that both enable Opus 4.6 agent to score 1.0 vs. 0.67 without skillfinding0.778
Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content
Qwen-2.5-3B ASR drops from 98.6% at dim 1 to 45.1% at dim 2, recovering partially then declining to 65.3% at dim 5finding0.768
Smaller models show non-monotonic and diminished ASR with increasing cone dimensionality
Qwen3-235B achieves only 1.1 pp harness-benefit on SkillsBench despite 4.7% base pass rate, near Qwen3-32B's 0.0% baselinefinding0.751
Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
Qwen-2.5-7B achieves 100% ASR across all cone dimensions 1–5finding0.748
Experiment 2 result showing large models can support high-dimensional truth cones