finding

active

finding:pairing-weakest-anchor-agent-with-best-evolver-against-strongest-anchor-with-worst-evolver-the-strong-agent-still-leads-by-18-6-to-35-2-pp-on-every-benchmark

Pairing weakest anchor agent with best evolver against strongest anchor with worst evolver, the strong agent still leads by 18.6 to 35.2 pp on every benchmark

Confirms that post-evolution performance bottleneck is on the agent side, not evolver side

Source paper

extracted_from

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

Neighborhood — ranked by edge-count

Claims (1)

claim

Post-evolution performance is dominated by the task-solving agent's base capability, not by evolver identity
supports
Practical implication of Observation 2 in evolver-side analysis

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.773
Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
Within-agent spread across seven evolvers is at most 5.1 pp (Qwen3-235B on MCP), small against the 36.0 pp gap between Opus and Qwen3-235B base capabilitiesfinding0.771
Demonstrates that post-evolution score is dominated by agent base capability, not evolver identity
Capability budget should be allocated to the task-solving agent rather than the evolver, since harness-updating varies by at most 3.1 pp across evolversclaim0.761
Primary design recommendation derived from harness-updating flatness finding
Qwen3.5-9B evolver achieves highest harness-updating gain on SkillsBench (3.8 pp), exceeding Claude Opus 4.6 (2.3 pp) and Qwen3-235B (1.5 pp)finding0.746
Case demonstrating that model scale does not predict harness-updating quality
Dense but off-task anchors yield high ρd AND high dr; behavior does not improve, consistent with mismatch dominating Sfinding0.744
E3 negative control validating that both ρd AND dr must be favorable for S to exceed Sc
Math and code tasks show strongest mid-layer anchoring on LLaMA (S ≈ −1.65 at layers 8-12)finding0.744
Task-specific E3 finding showing compositional reasoning requires deeper processing
Peak anchoring Sbmax and normalized area AUSN correlate with per-item success and internal shot midpoints θ50, providing a geometry-to-behavior bridge.claim0.740
Main interpretation of E3.
Feature pair A/1/3949 and B/1/3321 have activation correlation 0.98 but negative logit weight correlation, firing on PLOSOne journal abbreviationsfinding0.737
Demonstrates that activation similarity can diverge from logit weight similarity due to interference