finding

active

finding:within-agent-spread-across-seven-evolvers-is-at-most-5-1-pp-qwen3-235b-on-mcp-small-against-the-36-0-pp-gap-between-opus-and-qwen3-235b-base-capabilities

Within-agent spread across seven evolvers is at most 5.1 pp (Qwen3-235B on MCP), small against the 36.0 pp gap between Opus and Qwen3-235B base capabilities

Demonstrates that post-evolution score is dominated by agent base capability, not evolver identity

Source paper

extracted_from

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

Neighborhood — ranked by edge-count

Claims (1)

claim

Post-evolution performance is dominated by the task-solving agent's base capability, not by evolver identity
supports
Practical implication of Observation 2 in evolver-side analysis

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Qwen3.5-9B and Claude Opus 4.6 evolvers produce procedurally isomorphic flink-query skills that both enable Opus 4.6 agent to score 1.0 vs. 0.67 without skillfinding0.789
Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content
Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.789
Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
Qwen3.5-9B evolver achieves highest harness-updating gain on SkillsBench (3.8 pp), exceeding Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp)finding0.781
Case demonstrating that model scale does not predict harness-updating quality
Pairing weakest anchor agent with best evolver against strongest anchor with worst evolver, the strong agent still leads by 18.6 to 35.2 pp on every benchmarkfinding0.771
Confirms that post-evolution performance bottleneck is on the agent side, not evolver side
Capability budget should be allocated to the task-solving agent rather than the evolver, since harness-updating varies by at most 3.1 pp across evolversclaim0.747
Primary design recommendation derived from harness-updating flatness finding
On SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 ppfinding0.742
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBenchfinding0.742
Quantifies harness adherence failure gap between strong and weak tier models
Harness-updating gain spread is at most 3.1 percentage points across all evolvers on any single benchmarkfinding0.737
Core finding that harness-updating capability does not scale with model base capability