finding

active

finding:qwen3-32b-adherence-drops-from-0-52-after-harness-loading-to-0-13-at-final-validation-drift-of-0-39

Qwen3-32B adherence drops from 0.52 after harness loading to 0.13 at final validation (drift of -0.39)

Demonstrates long-horizon instruction-following bottleneck for weak-tier models

Source paper

extracted_from

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

Neighborhood — ranked by edge-count

Claims (1)

claim

Even when the harness is loaded, weak-tier models fail to adhere to it due to weak instruction-following over long-horizon tasks, drifting more than four times more steeply than strong models
supports
Diagnosis of second failure mode explaining low harness-benefit for weak-tier models

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

GPT-OSS-120B adherence drops from 0.67 after harness loading to 0.43 at final validation (drift of -0.24)finding0.855
Mid-tier model shows moderate adherence drift compared to weak and strong tiers
Opus 4.6 adherence remains stable from 0.89 after harness loading to 0.80 at final validation (drift of -0.09)finding0.832
Strong-tier model maintains harness adherence over long-horizon trajectories
Qwen3-235B achieves only 1.1 pp harness-benefit on SkillsBench despite 4.7% base pass rate, near Qwen3-32B's 0.0% baselinefinding0.779
Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
On SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 ppfinding0.774
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.766
Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
Weak-tier models often fail to invoke relevant harness artifacts during task-solving, with Qwen3-32B showing a 25% load rate against ~96% for strong modelsclaim0.760
Diagnosis of first failure mode explaining low harness-benefit for weak-tier models
Qwen-2.5-3B ASR drops from 98.6% at dim 1 to 45.1% at dim 2, recovering partially then declining to 65.3% at dim 5finding0.759
Smaller models show non-monotonic and diminished ASR with increasing cone dimensionality
Qwen3.5-9B evolver achieves highest harness-updating gain on SkillsBench (3.8 pp), exceeding Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp)finding0.750
Case demonstrating that model scale does not predict harness-updating quality