thinker:yisi-sangYisi Sang
Authored papers (1)
Harness-updating capability is essentially flat across model capability tiers, while harness-benefit is non-monotonic — a decoupling with direct implications for how capability budgets should be allocated in self-evolving LLM agent systems. Across seven LLMs (including Claude Opus 4.6, Qwen3.5-9B, and GPT-OSS-120B) and three benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench), the gap between the best and worst evolver in harness-updating gain is at most 3.1 percentage points on any single benchmark, and Qwen3.5-9B produces skills procedurally isomorphic to those of Claude Opus 4.6 on the flink-query task, yielding identical downstream pass rates of 1.0. In contrast, harness-benefit peaks at mid-tier models (e.g., GPT-OSS-120B gains 7.0 pp on MCP-Atlas, Qwen3-235B gains 19.3 pp on SWE), with weak-tier models like Qwen3-32B gaining as little as 4.4 pp on SWE despite having the largest performance headroom. The paper introduces a two-capability decomposition framework — separating harness-updating from harness-benefit — and identifies two failure modes that explain weak-tier underperformance: harness activation failure (Qwen3-32B skill-load rate of 25.1% versus ~96% for strong-tier models) and harness adherence failure (Qwen3-32B adherence score drifts from 0.52 at harness load to 0.13 at final validation, a decay four times steeper than Opus 4.6's). These findings imply that capability investment should flow to the task-solving agent rather than the evolver, and that agent training should explicitly target harness invocation and long-horizon instruction following as first-class skills.
More papers — OpenAlex / S2
Co-authors (12)
- Benoit Dumoulin2 shared
- Bing He2 shared
- Cihang Xie2 shared
- Hanqing Lu2 shared
- Juncheng Wu2 shared
- Minhua Lin2 shared
- Suhang Wang2 shared
- Xiang Zhang2 shared
- Yuyin Zhou2 shared
- Zhan Shi2 shared
- Zijun Wang2 shared
- Dakuo Wang1 shared
Recent mentions (1)
- papers-typedlin-2026-harness.md