thinker
active
thinker:juncheng-wu

Juncheng Wu

Authored
1
Introduces
0
Studies
0
Affiliations
1
Cited by
0

Authored papers (1)

  • Harness-updating capability is essentially flat across model capability tiers, while harness-benefit is non-monotonic — a decoupling with direct implications for how capability budgets should be allocated in self-evolving LLM agent systems. Across seven LLMs (including Claude Opus 4.6, Qwen3.5-9B, and GPT-OSS-120B) and three benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench), the gap between the best and worst evolver in harness-updating gain is at most 3.1 percentage points on any single benchmark, and Qwen3.5-9B produces skills procedurally isomorphic to those of Claude Opus 4.6 on the flink-query task, yielding identical downstream pass rates of 1.0. In contrast, harness-benefit peaks at mid-tier models (e.g., GPT-OSS-120B gains 7.0 pp on MCP-Atlas, Qwen3-235B gains 19.3 pp on SWE), with weak-tier models like Qwen3-32B gaining as little as 4.4 pp on SWE despite having the largest performance headroom. The paper introduces a two-capability decomposition framework — separating harness-updating from harness-benefit — and identifies two failure modes that explain weak-tier underperformance: harness activation failure (Qwen3-32B skill-load rate of 25.1% versus ~96% for strong-tier models) and harness adherence failure (Qwen3-32B adherence score drifts from 0.52 at harness load to 0.13 at final validation, a decay four times steeper than Opus 4.6's). These findings imply that capability investment should flow to the task-solving agent rather than the evolver, and that agent training should explicitly target harness invocation and long-horizon instruction following as first-class skills.

More papers — OpenAlex / S2

Affiliations (1)

Co-authors (12)

Recent mentions (1)