paper:doi-10-48550-arxiv-2605-30621Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents
TL;DR
Harness-updating capability is essentially flat across model capability tiers, while harness-benefit is non-monotonic — a decoupling with direct implications for how capability budgets should be allocated in self-evolving LLM agent systems. Across seven LLMs (including Claude Opus 4.6, Qwen3.5-9B, and GPT-OSS-120B) and three benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench), the gap between the best and worst evolver in harness-updating gain is at most 3.1 percentage points on any single benchmark, and Qwen3.5-9B produces skills procedurally isomorphic to those of Claude Opus 4.6 on the flink-query task, yielding identical downstream pass rates of 1.0. In contrast, harness-benefit peaks at mid-tier models (e.g., GPT-OSS-120B gains 7.0 pp on MCP-Atlas, Qwen3-235B gains 19.3 pp on SWE), with weak-tier models like Qwen3-32B gaining as little as 4.4 pp on SWE despite having the largest performance headroom. The paper introduces a two-capability decomposition framework — separating harness-updating from harness-benefit — and identifies two failure modes that explain weak-tier underperformance: harness activation failure (Qwen3-32B skill-load rate of 25.1% versus ~96% for strong-tier models) and harness adherence failure (Qwen3-32B adherence score drifts from 0.52 at harness load to 0.13 at final validation, a decay four times steeper than Opus 4.6's). These findings imply that capability investment should flow to the task-solving agent rather than the evolver, and that agent training should explicitly target harness invocation and long-horizon instruction following as first-class skills.
What to take away
- 1. The harness-updating gain (∆update) varies by at most 3.1 percentage points across all seven evolvers on any single benchmark, meaning evolver model scale is not a reliable predictor of the quality of harness updates produced.
- 2. Qwen3.5-9B acting as evolver achieves a ∆update of 3.8 pp on SkillsBench, exceeding both Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp) on the same benchmark.
- 3. On the flink-query SkillsBench task, a Qwen3.5-9B evolver and a Claude Opus 4.6 evolver both produce procedurally isomorphic skills that raise the same Opus 4.6 task-solving agent from a score of 0.67 to 1.0.
- 4. Post-evolution pass rate is dominated by agent identity rather than evolver identity: even pairing the weakest anchor agent with its best evolver against the strongest anchor agent with its worst evolver, the strong agent leads by 18.6 to 35.2 pp across all three benchmarks.
- 5. Harness-benefit (∆benefit) is non-monotonic in base capability: on SWE-bench Verified, Qwen3-235B (mid-tier, base 20.7%) gains 19.3 pp while Qwen3-32B (weaker, base 3.6%) gains only 4.4 pp and Claude Opus 4.6 (strongest, base 74.2%) gains only 2.6 pp.
- 6. Weak-tier model Qwen3-32B has a skill-load rate of 25.1% on SkillsBench versus approximately 96% for strong-tier models (Opus 4.6: 0.957, Sonnet 4.6: 0.959, Qwen3-235B: 0.961), constituting a harness activation failure mode.
- 7. Even when skills are successfully loaded, Qwen3-32B's harness adherence score drops from 0.52 immediately after harness loading to 0.13 at final validation, compared with Opus 4.6's drop from 0.89 to 0.80, indicating a long-horizon instruction-following bottleneck distinct from activation failure.
- 8. The paper introduces a controlled two-capability decomposition methodology — varying task-solving agents and evolvers independently across an anchor set, then computing ∆update and ∆benefit separately — which any researcher could replicate by fixing prompt templates and initial harness state across all agent-evolver pairs within a benchmark.
- 9. An open question raised is whether the flat harness-updating result generalizes beyond skill-based harness components to prompt- and memory-based harnesses, since evolvable components differ by benchmark (skills only for SWE and SkillsBench; skills, prompts, and memories for MCP-Atlas) and the paper does not decompose evolver gains by artifact type.
- 10. Qwen3-235B exhibits a clean dissociation between activation and adherence: its skill-load rate (0.961) nearly matches Opus 4.6, but its harness-following rate (0.350) and pass-when-loaded rate (0.022) are far below Opus 4.6's (0.757 HFR, 0.177 LPR), showing that activation and adherence are separable failure modes.
Peer brief — for seminar discussion
The paper investigates a specific gap in the self-evolving LLM agent literature: prior evaluations report end-to-end gains from harness evolution but cannot attribute those gains to the evolver model (which produces harness updates) versus the task-solving agent (which benefits from them). To disentangle these contributions, a two-capability decomposition framework is introduced with two operationalized metrics — harness-updating gain (∆update, averaged across anchor agents) and harness-benefit gain (∆benefit, maximized across anchor evolvers) — and a full factorial experiment crosses six task-solving agents with seven evolvers on three benchmarks: SWE-bench Verified (500 software-engineering tasks), MCP-Atlas (500 multi-server tool-use tasks), and SkillsBench (86 skill-based execution tasks across 11 domains). Models span Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, Qwen3-235B, Qwen3-32B, GPT-OSS-120B, and Qwen3.5-9B. The load-bearing finding is a two-part decoupling. First, harness-updating is flat across capability tiers: the spread in ∆update across all seven evolvers never exceeds 3.1 percentage points on any benchmark, and Qwen3.5-9B produces skills on SkillsBench (3.8 pp gain) that outperform both Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp); a case study confirms the 9B model writes procedurally isomorphic skills to those of Opus 4.6 on the flink-query task. Second, harness-benefit is non-monotonic: mid-tier models gain most (Qwen3-235B +19.3 pp on SWE, GPT-OSS-120B +7.0 pp on MCP), while both weak-tier and strong-tier models gain less, with strong models attributably hitting performance ceilings. The weak-tier shortfall is traced to two failure modes: harness activation failure (Qwen3-32B skill-load rate 25.1% vs. ~96% for strong models) and harness adherence failure (Qwen3-32B adherence drops from 0.52 to 0.13 across the trajectory, a decay four times steeper than Opus 4.6's 0.89-to-0.80 drift). The implication is that capability investment should flow to the task-solving agent rather than the evolver, and that agent training should treat harness invocation and long-horizon instruction following as explicit training targets. An alternative design the study could have employed is parametric fine-tuning of the evolver on curated trajectory data, which would test whether the flat harness-updating result holds when the evolver is trained rather than prompted — a condition explicitly excluded from scope. The central thing a critical reader would push back on is the operationalization of ∆benefit as the maximum gain across only three anchor evolvers (Opus 4.6, Sonnet 4.6, Qwen3-235B): because all three are relatively capable models, the harness quality presented to agents may already be near ceiling for what a prompted evolver can produce, potentially compressing the signal and making the non-monotonic pattern harder to interpret for weaker agents that might respond differently to lower-quality harness updates. The paper hypothesizes that the flat harness-updating result reflects a procedural-content ceiling where any sufficiently capable evolver converges on the same skill recipes, but does not empirically test this against a wider range of evolver capabilities below the 9B threshold.
Methods (4)
- Harness-Benefit Gain (Δbenefit)Metric measuring harness-benefit capability as the maximum pairwise gain across a fixed anchor evolver set
- Harness-Following Rate MeasurementLLM-judge pipeline measuring fraction of skill-loaded trajectories where agent follows loaded skill guidance, using Claude Sonnet 4.6 as judge
- Harness-Updating Gain (Δupdate)Metric measuring harness-updating capability as the mean pairwise gain across an anchor agent set
- Phase-Level Adherence JudgeSeparate LLM judge that partitions trajectories into five phases and assigns 0–1 adherence scores per phase using Claude Sonnet 4.6
Frameworks (1)
- Harness Evolution Capability FrameworkThe paper's conceptual framework decomposing harness self-evolution into harness-updating and harness-benefit capabilities, distinct from base capability
Findings (21)
- Qwen3-32B on pg-essay-to-audiobook loads the TTS-fallback skill but treats it as literal script, skips fallback chain after first failure, and emits task_complete:true without valid output
Case study illustrating procedural-execution-layer failure in harness adherence
- Qwen3-32B on threejs task issues a multi-key JSON action bundling load_skill with analysis and plan, causing the format gate to reject it and the skill to never enter context
Case study illustrating action-protocol-layer failure in harness activation
- Pairing weakest anchor agent with best evolver against strongest anchor with worst evolver, the strong agent still leads by 18.6 to 35.2 pp on every benchmark
Confirms that post-evolution performance bottleneck is on the agent side, not evolver side
- Qwen3.5-9B and Claude Opus 4.6 evolvers produce procedurally isomorphic flink-query skills that both enable Opus 4.6 agent to score 1.0 vs. 0.67 without skill
Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content
- Within-agent spread across seven evolvers is at most 5.1 pp (Qwen3-235B on MCP), small against the 36.0 pp gap between Opus and Qwen3-235B base capabilities
Demonstrates that post-evolution score is dominated by agent base capability, not evolver identity
- On SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 pp
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
- Qwen3.5-9B evolver achieves highest harness-updating gain on SkillsBench (3.8 pp), exceeding Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp)
Case demonstrating that model scale does not predict harness-updating quality
- On SWE-bench, Claude Opus 4.6 and Claude Sonnet 4.6 both achieve 7.4 pp harness-updating gain; Claude Haiku 4.5 achieves 8.0 pp
Full evolver-side SWE results showing comparable performance across Claude family tiers
- Qwen3-235B achieves only 1.1 pp harness-benefit on SkillsBench despite 4.7% base pass rate, near Qwen3-32B's 0.0% baseline
Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
- On MCP-Atlas, harness-benefit peaks at GPT-OSS-120B (7.0 pp), with lower gains at both ends of the base-capability scale
Replication of non-monotonic harness-benefit pattern on a second benchmark
Claims (11)
- End-to-end evaluation scores conflate three sources of improvement: base capability, harness-updating quality, and harness-benefit, leaving it unclear which models produce useful updates or benefit most from them
Motivating claim for the paper's controlled analysis approach
- Even when the harness is loaded, weak-tier models fail to adhere to it due to weak instruction-following over long-horizon tasks, drifting more than four times more steeply than strong models
Diagnosis of second failure mode explaining low harness-benefit for weak-tier models
- Long-horizon instruction following is a second key training target for agent development, as even loaded harnesses are not followed faithfully over extended trajectories by weak models
Design recommendation derived from harness adherence failure and phase-level drift findings
- Harness-benefit is non-monotonic in base capability: weak-tier models benefit little, mid-tier models benefit most, and strong-tier models benefit less than mid-tier
Second major claim of the paper, supported by Δbenefit measurements across six models on three benchmarks
- Weak-tier model deficits are not in task understanding but in protocol-level and procedural execution: they identify the right skill but cannot operate under it
Diagnostic claim from case studies of activation and adherence failures in Qwen3-32B
- Harness-updating capability is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains
First major claim of the paper, supported by narrow spread across evolvers and case study
- Weak-tier models often fail to invoke relevant harness artifacts during task-solving, with Qwen3-32B showing a 25% load rate against ~96% for strong models
Diagnosis of first failure mode explaining low harness-benefit for weak-tier models
- Harness invocation should be treated as a first-class learned skill and baked into agent training, as weak-tier models fail to load skills 75% of the time
Design recommendation derived from harness activation failure finding
- Capability budget should be allocated to the task-solving agent rather than the evolver, since harness-updating varies by at most 3.1 pp across evolvers
Primary design recommendation derived from harness-updating flatness finding
- Loading the harness is not sufficient for benefiting from it: a model with near-ceiling SLR can still have low HFR and LPR
Derived from Qwen3-235B's dissociation between SLR (0.961) and HFR (0.350)
Hypotheses (1)
- Strong-tier models benefit less from harness evolution because they already solve many tasks under the initial harness, leaving less room for improvement (ceiling effect)
Explanation offered for why high-base-capability models show lower Δbenefit
Questions (4)
- what explains why weak-tier models with the most performance headroom benefit least from harness evolution?
In-depth diagnostic question addressed by the two failure mode analysis
- does a model's base capability in task-solving predict its capabilities in harness self-evolution?
Central framing question motivating the paper's capability decomposition
- which models actually benefit from updated harnesses?
Second open question the paper sets out to answer through agent-side analysis
- which models produce useful harness updates?
First open question the paper sets out to answer through evolver-side analysis
Original abstract (expand)
This paper analyzes two distinct capabilities in harness self-evolution for LLM agents: harness-updating (producing useful harness updates from execution evidence) and harness-benefit (benefiting from updated harnesses during task solving). The analysis reveals that harness-updating is flat across capability tiers—models from different capability levels produce similarly useful updates—while harness-benefit is non-monotonic, with mid-tier models benefiting most and weak-tier models benefiting little due to failures in harness activation and adherence. The findings suggest investing in task-solving agent capabilities rather than evolver capabilities, and targeting harness invocation and instruction following in agent training.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge ExplorationDongyang Ma, Tianqing Fang, Jia Li, Jing Tang, Nuo Chen, Haitao Mi, Yan Wang Qifan Zhang2026≈ 82%
- Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model DiffingFahim Dalvi, Nadir Durrani, Majd Hawasly Sabri Boughorbel2025≈ 80%
- Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and EthicsMuhammad Zaeem Khan, Aleesha Zainab, Saleha Jamshed, Sadia Ahmad, Kaynat Khatib, Faria Bibi, and Abdul Rehman Asifullah Khan2026≈ 80%
- MASteer: Multi-Agent Adaptive Steer Strategy for End-to-End LLM Trustworthiness RepairTianlin Li, Xiaohan Zhang, Aishan Liu, Li Pan Changqing Li2025≈ 79%
- A Subgoal-driven Framework for Improving Long-Horizon LLM AgentsSian Gooding, Florian Hartmann, Oriana Riva, Edward Grefenstette Taiyi Wang2026≈ 79%
- Quantifying LLM Attention-Head Stability: Implications for Circuit UniversalityJack Stanley, Praneet Suresh, Danilo Bzdok Karan Bali2026≈ 79%
- Adaptation of Agentic AI: A Survey of Post-Training, Memory, and SkillsJiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, Xueqiang Xu, Hanwen Xu, Pengrui Han, Dylan Zhang, Jiashuo Sun, Chaoqi Yang, Kun Qian, Tian Wang, Changran Hu, Manling Li, Quanzheng Li, Hao Peng, Sheng Wang, Jingbo Shang, Chao Zhang, Jiaxuan You, Liyuan Liu, Pan Lu, Yu Zhang, Heng Ji, Yejin Choi, Dawn Song, Jimeng Sun, Jiawei Han Pengcheng Jiang2026≈ 79%
- ASA: Training-Free Representation Engineering for Tool-Calling AgentsRun Zhou, Rong Fu, Shuaishuai Cao, Hongwei Zeng, Jiaxuan Lu, Sicheng Fan, Jiaqiao Zhao, and Liangming Pan Youjin Wang2026≈ 78%
- Specialization after Generalization: Towards Understanding Test-Time Training in Foundation ModelsPatrik Wolf, Alexander Shevchenko, Dennis J\"uni, Andreas Krause, Gil Kur Jonas H\"ubotter2026≈ 78%
- Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-ExpertsMan Yung Wong (Russell)2026≈ 78%
- One Token Away from Collapse: The Fragility of Instruction-Tuned HelpfulnessSeyedarmin Azizi, Souvik Kundu, Massoud Pedram Erfan Baghaei Potraghloo2026≈ 78%
- Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation ControlChaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye Yuxin Xiao2024≈ 78%
- Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse AutoencoderZhen Tan, Song Wang, Kaidi Xu, Tianlong Chen Zhen Xu2025≈ 78%
- Emergence: Overcoming Privileged Information Bias in Asymmetric Embodied Agents via Active QueryingShaun Baek and Sam Liu and Joseph Ukpong2025≈ 78%
- ≈ 78%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 77%
- ≈ 77%
- ≈ 77%
- Alignment faking in large language modelsin corpus2024≈ 77%
- ≈ 76%
- Testing the Limits of Truth Directions in LLMsin corpus2026≈ 76%
- ≈ 76%
- Simulators — LessWrongin corpus≈ 76%
- ≈ 76%
- ≈ 76%
- ≈ 76%
- Anima Labs Phenomenology Pt1in corpus≈ 76%
- ≈ 75%
- ≈ 75%
+29 more