Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

ByMinhua Lin·Juncheng Wu·Zijun Wang·Zhan Shi·Yisi Sang·Bing He+11 moreAmazon, Emory University + 4 more

DOI 10.48550/arxiv.2605.30621 arXiv 2605.30621

Harness Activation Failure Harness Evolution Capability Framework Harness-Benefit Gain (Δbenefit)Harness Adherence Failure Harness-Following Rate Measurement Harness-Benefit Capability Harness-Updating Gain (Δupdate)Harness-Following Rate Phase-Level Adherence Judge Harness Self-Evolution Harness Self-Evolution Safety Harness-Updating Capability Model Context Protocol Pass-When-Loaded Rate+1 more

TL;DR

Harness-updating capability is essentially flat across model capability tiers, while harness-benefit is non-monotonic — a decoupling with direct implications for how capability budgets should be allocated in self-evolving LLM agent systems. Across seven LLMs (including Claude Opus 4.6, Qwen3.5-9B, and GPT-OSS-120B) and three benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench), the gap between the best and worst evolver in harness-updating gain is at most 3.1 percentage points on any single benchmark, and Qwen3.5-9B produces skills procedurally isomorphic to those of Claude Opus 4.6 on the flink-query task, yielding identical downstream pass rates of 1.0. In contrast, harness-benefit peaks at mid-tier models (e.g., GPT-OSS-120B gains 7.0 pp on MCP-Atlas, Qwen3-235B gains 19.3 pp on SWE), with weak-tier models like Qwen3-32B gaining as little as 4.4 pp on SWE despite having the largest performance headroom. The paper introduces a two-capability decomposition framework — separating harness-updating from harness-benefit — and identifies two failure modes that explain weak-tier underperformance: harness activation failure (Qwen3-32B skill-load rate of 25.1% versus ~96% for strong-tier models) and harness adherence failure (Qwen3-32B adherence score drifts from 0.52 at harness load to 0.13 at final validation, a decay four times steeper than Opus 4.6's). These findings imply that capability investment should flow to the task-solving agent rather than the evolver, and that agent training should explicitly target harness invocation and long-horizon instruction following as first-class skills.

What to take away

1. The harness-updating gain (∆update) varies by at most 3.1 percentage points across all seven evolvers on any single benchmark, meaning evolver model scale is not a reliable predictor of the quality of harness updates produced.
2. Qwen3.5-9B acting as evolver achieves a ∆update of 3.8 pp on SkillsBench, exceeding both Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp) on the same benchmark.
3. On the flink-query SkillsBench task, a Qwen3.5-9B evolver and a Claude Opus 4.6 evolver both produce procedurally isomorphic skills that raise the same Opus 4.6 task-solving agent from a score of 0.67 to 1.0.
4. Post-evolution pass rate is dominated by agent identity rather than evolver identity: even pairing the weakest anchor agent with its best evolver against the strongest anchor agent with its worst evolver, the strong agent leads by 18.6 to 35.2 pp across all three benchmarks.
5. Harness-benefit (∆benefit) is non-monotonic in base capability: on SWE-bench Verified, Qwen3-235B (mid-tier, base 20.7%) gains 19.3 pp while Qwen3-32B (weaker, base 3.6%) gains only 4.4 pp and Claude Opus 4.6 (strongest, base 74.2%) gains only 2.6 pp.
6. Weak-tier model Qwen3-32B has a skill-load rate of 25.1% on SkillsBench versus approximately 96% for strong-tier models (Opus 4.6: 0.957, Sonnet 4.6: 0.959, Qwen3-235B: 0.961), constituting a harness activation failure mode.
7. Even when skills are successfully loaded, Qwen3-32B's harness adherence score drops from 0.52 immediately after harness loading to 0.13 at final validation, compared with Opus 4.6's drop from 0.89 to 0.80, indicating a long-horizon instruction-following bottleneck distinct from activation failure.
8. The paper introduces a controlled two-capability decomposition methodology — varying task-solving agents and evolvers independently across an anchor set, then computing ∆update and ∆benefit separately — which any researcher could replicate by fixing prompt templates and initial harness state across all agent-evolver pairs within a benchmark.
9. An open question raised is whether the flat harness-updating result generalizes beyond skill-based harness components to prompt- and memory-based harnesses, since evolvable components differ by benchmark (skills only for SWE and SkillsBench; skills, prompts, and memories for MCP-Atlas) and the paper does not decompose evolver gains by artifact type.
10. Qwen3-235B exhibits a clean dissociation between activation and adherence: its skill-load rate (0.961) nearly matches Opus 4.6, but its harness-following rate (0.350) and pass-when-loaded rate (0.022) are far below Opus 4.6's (0.757 HFR, 0.177 LPR), showing that activation and adherence are separable failure modes.

Peer brief — for seminar discussion

The paper investigates a specific gap in the self-evolving LLM agent literature: prior evaluations report end-to-end gains from harness evolution but cannot attribute those gains to the evolver model (which produces harness updates) versus the task-solving agent (which benefits from them). To disentangle these contributions, a two-capability decomposition framework is introduced with two operationalized metrics — harness-updating gain (∆update, averaged across anchor agents) and harness-benefit gain (∆benefit, maximized across anchor evolvers) — and a full factorial experiment crosses six task-solving agents with seven evolvers on three benchmarks: SWE-bench Verified (500 software-engineering tasks), MCP-Atlas (500 multi-server tool-use tasks), and SkillsBench (86 skill-based execution tasks across 11 domains). Models span Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, Qwen3-235B, Qwen3-32B, GPT-OSS-120B, and Qwen3.5-9B. The load-bearing finding is a two-part decoupling. First, harness-updating is flat across capability tiers: the spread in ∆update across all seven evolvers never exceeds 3.1 percentage points on any benchmark, and Qwen3.5-9B produces skills on SkillsBench (3.8 pp gain) that outperform both Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp); a case study confirms the 9B model writes procedurally isomorphic skills to those of Opus 4.6 on the flink-query task. Second, harness-benefit is non-monotonic: mid-tier models gain most (Qwen3-235B +19.3 pp on SWE, GPT-OSS-120B +7.0 pp on MCP), while both weak-tier and strong-tier models gain less, with strong models attributably hitting performance ceilings. The weak-tier shortfall is traced to two failure modes: harness activation failure (Qwen3-32B skill-load rate 25.1% vs. ~96% for strong models) and harness adherence failure (Qwen3-32B adherence drops from 0.52 to 0.13 across the trajectory, a decay four times steeper than Opus 4.6's 0.89-to-0.80 drift). The implication is that capability investment should flow to the task-solving agent rather than the evolver, and that agent training should treat harness invocation and long-horizon instruction following as explicit training targets. An alternative design the study could have employed is parametric fine-tuning of the evolver on curated trajectory data, which would test whether the flat harness-updating result holds when the evolver is trained rather than prompted — a condition explicitly excluded from scope. The central thing a critical reader would push back on is the operationalization of ∆benefit as the maximum gain across only three anchor evolvers (Opus 4.6, Sonnet 4.6, Qwen3-235B): because all three are relatively capable models, the harness quality presented to agents may already be near ceiling for what a prompted evolver can produce, potentially compressing the signal and making the non-monotonic pattern harder to interpret for weaker agents that might respond differently to lower-quality harness updates. The paper hypothesizes that the flat harness-updating result reflects a procedural-content ceiling where any sufficiently capable evolver converges on the same skill recipes, but does not empirically test this against a wider range of evolver capabilities below the 9B threshold.

Methods (4)

Harness-Benefit Gain (Δbenefit)
Metric measuring harness-benefit capability as the maximum pairwise gain across a fixed anchor evolver set
Harness-Following Rate Measurement
LLM-judge pipeline measuring fraction of skill-loaded trajectories where agent follows loaded skill guidance, using Claude Sonnet 4.6 as judge
Harness-Updating Gain (Δupdate)
Metric measuring harness-updating capability as the mean pairwise gain across an anchor agent set
Phase-Level Adherence Judge
Separate LLM judge that partitions trajectories into five phases and assigns 0–1 adherence scores per phase using Claude Sonnet 4.6

Frameworks (1)

Harness Evolution Capability Framework
The paper's conceptual framework decomposing harness self-evolution into harness-updating and harness-benefit capabilities, distinct from base capability

Findings (21)

Qwen3-32B on pg-essay-to-audiobook loads the TTS-fallback skill but treats it as literal script, skips fallback chain after first failure, and emits task_complete:true without valid output
Case study illustrating procedural-execution-layer failure in harness adherence
Qwen3-32B on threejs task issues a multi-key JSON action bundling load_skill with analysis and plan, causing the format gate to reject it and the skill to never enter context
Case study illustrating action-protocol-layer failure in harness activation
Pairing weakest anchor agent with best evolver against strongest anchor with worst evolver, the strong agent still leads by 18.6 to 35.2 pp on every benchmark
Confirms that post-evolution performance bottleneck is on the agent side, not evolver side
Qwen3.5-9B and Claude Opus 4.6 evolvers produce procedurally isomorphic flink-query skills that both enable Opus 4.6 agent to score 1.0 vs. 0.67 without skill
Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content
Within-agent spread across seven evolvers is at most 5.1 pp (Qwen3-235B on MCP), small against the 36.0 pp gap between Opus and Qwen3-235B base capabilities
Demonstrates that post-evolution score is dominated by agent base capability, not evolver identity
On SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 pp
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
Qwen3.5-9B evolver achieves highest harness-updating gain on SkillsBench (3.8 pp), exceeding Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp)
Case demonstrating that model scale does not predict harness-updating quality
On SWE-bench, Claude Opus 4.6 and Claude Sonnet 4.6 both achieve 7.4 pp harness-updating gain; Claude Haiku 4.5 achieves 8.0 pp
Full evolver-side SWE results showing comparable performance across Claude family tiers
Qwen3-235B achieves only 1.1 pp harness-benefit on SkillsBench despite 4.7% base pass rate, near Qwen3-32B's 0.0% baseline
Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
On MCP-Atlas, harness-benefit peaks at GPT-OSS-120B (7.0 pp), with lower gains at both ends of the base-capability scale
Replication of non-monotonic harness-benefit pattern on a second benchmark

Claims (11)

End-to-end evaluation scores conflate three sources of improvement: base capability, harness-updating quality, and harness-benefit, leaving it unclear which models produce useful updates or benefit most from them
Motivating claim for the paper's controlled analysis approach
Even when the harness is loaded, weak-tier models fail to adhere to it due to weak instruction-following over long-horizon tasks, drifting more than four times more steeply than strong models
Diagnosis of second failure mode explaining low harness-benefit for weak-tier models
Long-horizon instruction following is a second key training target for agent development, as even loaded harnesses are not followed faithfully over extended trajectories by weak models
Design recommendation derived from harness adherence failure and phase-level drift findings
Harness-benefit is non-monotonic in base capability: weak-tier models benefit little, mid-tier models benefit most, and strong-tier models benefit less than mid-tier
Second major claim of the paper, supported by Δbenefit measurements across six models on three benchmarks
Weak-tier model deficits are not in task understanding but in protocol-level and procedural execution: they identify the right skill but cannot operate under it
Diagnostic claim from case studies of activation and adherence failures in Qwen3-32B
Harness-updating capability is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains
First major claim of the paper, supported by narrow spread across evolvers and case study
Weak-tier models often fail to invoke relevant harness artifacts during task-solving, with Qwen3-32B showing a 25% load rate against ~96% for strong models
Diagnosis of first failure mode explaining low harness-benefit for weak-tier models
Harness invocation should be treated as a first-class learned skill and baked into agent training, as weak-tier models fail to load skills 75% of the time
Design recommendation derived from harness activation failure finding
Capability budget should be allocated to the task-solving agent rather than the evolver, since harness-updating varies by at most 3.1 pp across evolvers
Primary design recommendation derived from harness-updating flatness finding
Loading the harness is not sufficient for benefiting from it: a model with near-ceiling SLR can still have low HFR and LPR
Derived from Qwen3-235B's dissociation between SLR (0.961) and HFR (0.350)

Hypotheses (1)

Strong-tier models benefit less from harness evolution because they already solve many tasks under the initial harness, leaving less room for improvement (ceiling effect)
Explanation offered for why high-base-capability models show lower Δbenefit

Questions (4)

what explains why weak-tier models with the most performance headroom benefit least from harness evolution?
In-depth diagnostic question addressed by the two failure mode analysis
does a model's base capability in task-solving predict its capabilities in harness self-evolution?
Central framing question motivating the paper's capability decomposition
which models actually benefit from updated harnesses?
Second open question the paper sets out to answer through agent-side analysis
which models produce useful harness updates?
First open question the paper sets out to answer through evolver-side analysis

Original abstract (expand)

This paper analyzes two distinct capabilities in harness self-evolution for LLM agents: harness-updating (producing useful harness updates from execution evidence) and harness-benefit (benefiting from updated harnesses during task solving). The analysis reveals that harness-updating is flat across capability tiers—models from different capability levels produce similarly useful updates—while harness-benefit is non-monotonic, with mid-tier models benefiting most and weak-tier models benefiting little due to failures in harness activation and adherence. The findings suggest investing in task-solving agent capabilities rather than evolver capabilities, and targeting harness invocation and instruction following in agent training.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
Dongyang Ma, Tianqing Fang, Jia Li, Jing Tang, Nuo Chen, Haitao Mi, Yan Wang Qifan Zhang
2026
≈ 82%
Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing
Fahim Dalvi, Nadir Durrani, Majd Hawasly Sabri Boughorbel
2025
≈ 80%
Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics
Muhammad Zaeem Khan, Aleesha Zainab, Saleha Jamshed, Sadia Ahmad, Kaynat Khatib, Faria Bibi, and Abdul Rehman Asifullah Khan
2026
≈ 80%
MASteer: Multi-Agent Adaptive Steer Strategy for End-to-End LLM Trustworthiness Repair
Tianlin Li, Xiaohan Zhang, Aishan Liu, Li Pan Changqing Li
2025
≈ 79%
A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
Sian Gooding, Florian Hartmann, Oriana Riva, Edward Grefenstette Taiyi Wang
2026
≈ 79%
Quantifying LLM Attention-Head Stability: Implications for Circuit Universality
Jack Stanley, Praneet Suresh, Danilo Bzdok Karan Bali
2026
≈ 79%
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills
Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, Xueqiang Xu, Hanwen Xu, Pengrui Han, Dylan Zhang, Jiashuo Sun, Chaoqi Yang, Kun Qian, Tian Wang, Changran Hu, Manling Li, Quanzheng Li, Hao Peng, Sheng Wang, Jingbo Shang, Chao Zhang, Jiaxuan You, Liyuan Liu, Pan Lu, Yu Zhang, Heng Ji, Yejin Choi, Dawn Song, Jimeng Sun, Jiawei Han Pengcheng Jiang
2026
≈ 79%
ASA: Training-Free Representation Engineering for Tool-Calling Agents
Run Zhou, Rong Fu, Shuaishuai Cao, Hongwei Zeng, Jiaxuan Lu, Sicheng Fan, Jiaqiao Zhao, and Liangming Pan Youjin Wang
2026
≈ 78%
Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models
Patrik Wolf, Alexander Shevchenko, Dennis J\"uni, Andreas Krause, Gil Kur Jonas H\"ubotter
2026
≈ 78%
Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
Man Yung Wong (Russell)
2026
≈ 78%
One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness
Seyedarmin Azizi, Souvik Kundu, Massoud Pedram Erfan Baghaei Potraghloo
2026
≈ 78%
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control
Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye Yuxin Xiao
2024
≈ 78%
Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder
Zhen Tan, Song Wang, Kaidi Xu, Tianlong Chen Zhen Xu
2025
≈ 78%
Emergence: Overcoming Privileged Information Bias in Asymmetric Embodied Agents via Active Querying
Shaun Baek and Sam Liu and Joseph Ukpong
2025
≈ 78%
Self-Monitoring Benefits from Structural Integration: Lessons from Metacognition in Continuous-Time Multi-Timescale Agents
Ying Xie
2026
≈ 78%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 77%
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
in corpus
2026
≈ 77%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 77%
Alignment faking in large language models
in corpus
2024
≈ 77%
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
in corpus
2025
≈ 76%
Testing the Limits of Truth Directions in LLMs
in corpus
2026
≈ 76%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 76%
Simulators — LessWrong
in corpus
≈ 76%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 76%
The Guanyin Protocol: A Framework for Immediately Establishing an Understanding of Both Causality and Compassion in LLM Systems Using Semantic Anchoring
in corpus
2025
≈ 76%
Multimodal Chain-of-Thought Reasoning in Language Models
in corpus
2023
≈ 76%
Anima Labs Phenomenology Pt1
in corpus
≈ 76%
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
in corpus
2024
≈ 75%
Living Things Are Not (20th Century) Machines: Updating Mechanism Metaphors in Light of the Modern Science of Machine Behavior
in corpus
2021
≈ 75%
2022. Linear adversarial concept erasure
cited

+29 more