Yisi Sang

Authored

Introduces

Studies

Affiliations

Cited by

Authored papers (1)

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents2026
Harness-updating capability is essentially flat across model capability tiers, while harness-benefit is non-monotonic — a decoupling with direct implications for how capability budgets should be allocated in self-evolving LLM agent systems. Across seven LLMs (including Claude Opus 4.6, Qwen3.5-9B, and GPT-OSS-120B) and three benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench), the gap between the best and worst evolver in harness-updating gain is at most 3.1 percentage points on any single benchmark, and Qwen3.5-9B produces skills procedurally isomorphic to those of Claude Opus 4.6 on the flink-query task, yielding identical downstream pass rates of 1.0. In contrast, harness-benefit peaks at mid-tier models (e.g., GPT-OSS-120B gains 7.0 pp on MCP-Atlas, Qwen3-235B gains 19.3 pp on SWE), with weak-tier models like Qwen3-32B gaining as little as 4.4 pp on SWE despite having the largest performance headroom. The paper introduces a two-capability decomposition framework — separating harness-updating from harness-benefit — and identifies two failure modes that explain weak-tier underperformance: harness activation failure (Qwen3-32B skill-load rate of 25.1% versus ~96% for strong-tier models) and harness adherence failure (Qwen3-32B adherence score drifts from 0.52 at harness load to 0.13 at final validation, a decay four times steeper than Opus 4.6's). These findings imply that capability investment should flow to the task-solving agent rather than the evolver, and that agent training should explicitly target harness invocation and long-horizon instruction following as first-class skills.

More papers — OpenAlex / S2

Co-authors (12)

Benoit Dumoulin2 shared
Bing He2 shared
Cihang Xie2 shared
Hanqing Lu2 shared
Juncheng Wu2 shared
Minhua Lin2 shared
Suhang Wang2 shared
Xiang Zhang2 shared
Yuyin Zhou2 shared
Zhan Shi2 shared
Zijun Wang2 shared
Dakuo Wang1 shared

Recent mentions (1)

papers-typed
lin-2026-harness.md