finding
active
finding:within-agent-spread-across-seven-evolvers-is-at-most-5-1-pp-qwen3-235b-on-mcp-small-against-the-36-0-pp-gap-between-opus-and-qwen3-235b-base-capabilitiesWithin-agent spread across seven evolvers is at most 5.1 pp (Qwen3-235B on MCP), small against the 36.0 pp gap between Opus and Qwen3-235B base capabilities
Demonstrates that post-evolution score is dominated by agent base capability, not evolver identity
Source paper
extracted_from(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13
Neighborhood — ranked by edge-count
Claims (1)
claim
- Practical implication of Observation 2 in evolver-side analysis
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content
- Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.789Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
- Case demonstrating that model scale does not predict harness-updating quality
- Confirms that post-evolution performance bottleneck is on the agent side, not evolver side
- Primary design recommendation derived from harness-updating flatness finding
- Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
- Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBenchfinding0.742Quantifies harness adherence failure gap between strong and weak tier models
- Harness-updating gain spread is at most 3.1 percentage points across all evolvers on any single benchmarkfinding0.737Core finding that harness-updating capability does not scale with model base capability