paper
active
2026
paper:doi-10-48550-arxiv-2605-30621

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

TL;DR

Harness-updating capability is essentially flat across model capability tiers, while harness-benefit is non-monotonic — a decoupling with direct implications for how capability budgets should be allocated in self-evolving LLM agent systems. Across seven LLMs (including Claude Opus 4.6, Qwen3.5-9B, and GPT-OSS-120B) and three benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench), the gap between the best and worst evolver in harness-updating gain is at most 3.1 percentage points on any single benchmark, and Qwen3.5-9B produces skills procedurally isomorphic to those of Claude Opus 4.6 on the flink-query task, yielding identical downstream pass rates of 1.0. In contrast, harness-benefit peaks at mid-tier models (e.g., GPT-OSS-120B gains 7.0 pp on MCP-Atlas, Qwen3-235B gains 19.3 pp on SWE), with weak-tier models like Qwen3-32B gaining as little as 4.4 pp on SWE despite having the largest performance headroom. The paper introduces a two-capability decomposition framework — separating harness-updating from harness-benefit — and identifies two failure modes that explain weak-tier underperformance: harness activation failure (Qwen3-32B skill-load rate of 25.1% versus ~96% for strong-tier models) and harness adherence failure (Qwen3-32B adherence score drifts from 0.52 at harness load to 0.13 at final validation, a decay four times steeper than Opus 4.6's). These findings imply that capability investment should flow to the task-solving agent rather than the evolver, and that agent training should explicitly target harness invocation and long-horizon instruction following as first-class skills.

What to take away

  1. 1. The harness-updating gain (∆update) varies by at most 3.1 percentage points across all seven evolvers on any single benchmark, meaning evolver model scale is not a reliable predictor of the quality of harness updates produced.
  2. 2. Qwen3.5-9B acting as evolver achieves a ∆update of 3.8 pp on SkillsBench, exceeding both Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp) on the same benchmark.
  3. 3. On the flink-query SkillsBench task, a Qwen3.5-9B evolver and a Claude Opus 4.6 evolver both produce procedurally isomorphic skills that raise the same Opus 4.6 task-solving agent from a score of 0.67 to 1.0.
  4. 4. Post-evolution pass rate is dominated by agent identity rather than evolver identity: even pairing the weakest anchor agent with its best evolver against the strongest anchor agent with its worst evolver, the strong agent leads by 18.6 to 35.2 pp across all three benchmarks.
  5. 5. Harness-benefit (∆benefit) is non-monotonic in base capability: on SWE-bench Verified, Qwen3-235B (mid-tier, base 20.7%) gains 19.3 pp while Qwen3-32B (weaker, base 3.6%) gains only 4.4 pp and Claude Opus 4.6 (strongest, base 74.2%) gains only 2.6 pp.
  6. 6. Weak-tier model Qwen3-32B has a skill-load rate of 25.1% on SkillsBench versus approximately 96% for strong-tier models (Opus 4.6: 0.957, Sonnet 4.6: 0.959, Qwen3-235B: 0.961), constituting a harness activation failure mode.
  7. 7. Even when skills are successfully loaded, Qwen3-32B's harness adherence score drops from 0.52 immediately after harness loading to 0.13 at final validation, compared with Opus 4.6's drop from 0.89 to 0.80, indicating a long-horizon instruction-following bottleneck distinct from activation failure.
  8. 8. The paper introduces a controlled two-capability decomposition methodology — varying task-solving agents and evolvers independently across an anchor set, then computing ∆update and ∆benefit separately — which any researcher could replicate by fixing prompt templates and initial harness state across all agent-evolver pairs within a benchmark.
  9. 9. An open question raised is whether the flat harness-updating result generalizes beyond skill-based harness components to prompt- and memory-based harnesses, since evolvable components differ by benchmark (skills only for SWE and SkillsBench; skills, prompts, and memories for MCP-Atlas) and the paper does not decompose evolver gains by artifact type.
  10. 10. Qwen3-235B exhibits a clean dissociation between activation and adherence: its skill-load rate (0.961) nearly matches Opus 4.6, but its harness-following rate (0.350) and pass-when-loaded rate (0.022) are far below Opus 4.6's (0.757 HFR, 0.177 LPR), showing that activation and adherence are separable failure modes.

Peer brief — for seminar discussion

The paper investigates a specific gap in the self-evolving LLM agent literature: prior evaluations report end-to-end gains from harness evolution but cannot attribute those gains to the evolver model (which produces harness updates) versus the task-solving agent (which benefits from them). To disentangle these contributions, a two-capability decomposition framework is introduced with two operationalized metrics — harness-updating gain (∆update, averaged across anchor agents) and harness-benefit gain (∆benefit, maximized across anchor evolvers) — and a full factorial experiment crosses six task-solving agents with seven evolvers on three benchmarks: SWE-bench Verified (500 software-engineering tasks), MCP-Atlas (500 multi-server tool-use tasks), and SkillsBench (86 skill-based execution tasks across 11 domains). Models span Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, Qwen3-235B, Qwen3-32B, GPT-OSS-120B, and Qwen3.5-9B. The load-bearing finding is a two-part decoupling. First, harness-updating is flat across capability tiers: the spread in ∆update across all seven evolvers never exceeds 3.1 percentage points on any benchmark, and Qwen3.5-9B produces skills on SkillsBench (3.8 pp gain) that outperform both Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp); a case study confirms the 9B model writes procedurally isomorphic skills to those of Opus 4.6 on the flink-query task. Second, harness-benefit is non-monotonic: mid-tier models gain most (Qwen3-235B +19.3 pp on SWE, GPT-OSS-120B +7.0 pp on MCP), while both weak-tier and strong-tier models gain less, with strong models attributably hitting performance ceilings. The weak-tier shortfall is traced to two failure modes: harness activation failure (Qwen3-32B skill-load rate 25.1% vs. ~96% for strong models) and harness adherence failure (Qwen3-32B adherence drops from 0.52 to 0.13 across the trajectory, a decay four times steeper than Opus 4.6's 0.89-to-0.80 drift). The implication is that capability investment should flow to the task-solving agent rather than the evolver, and that agent training should treat harness invocation and long-horizon instruction following as explicit training targets. An alternative design the study could have employed is parametric fine-tuning of the evolver on curated trajectory data, which would test whether the flat harness-updating result holds when the evolver is trained rather than prompted — a condition explicitly excluded from scope. The central thing a critical reader would push back on is the operationalization of ∆benefit as the maximum gain across only three anchor evolvers (Opus 4.6, Sonnet 4.6, Qwen3-235B): because all three are relatively capable models, the harness quality presented to agents may already be near ceiling for what a prompted evolver can produce, potentially compressing the signal and making the non-monotonic pattern harder to interpret for weaker agents that might respond differently to lower-quality harness updates. The paper hypothesizes that the flat harness-updating result reflects a procedural-content ceiling where any sufficiently capable evolver converges on the same skill recipes, but does not empirically test this against a wider range of evolver capabilities below the 9B threshold.

Methods (4)

Frameworks (1)

  • Harness Evolution Capability Framework
    The paper's conceptual framework decomposing harness self-evolution into harness-updating and harness-benefit capabilities, distinct from base capability

Findings (21)

Claims (11)

Questions (4)

Original abstract (expand)

This paper analyzes two distinct capabilities in harness self-evolution for LLM agents: harness-updating (producing useful harness updates from execution evidence) and harness-benefit (benefiting from updated harnesses during task solving). The analysis reveals that harness-updating is flat across capability tiers—models from different capability levels produce similarly useful updates—while harness-benefit is non-monotonic, with mid-tier models benefiting most and weak-tier models benefiting little due to failures in harness activation and adherence. The findings suggest investing in task-solving agent capabilities rather than evolver capabilities, and targeting harness invocation and instruction following in agent training.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+29 more

Similar preprints — Semantic Scholar