finding
active
finding:pairing-weakest-anchor-agent-with-best-evolver-against-strongest-anchor-with-worst-evolver-the-strong-agent-still-leads-by-18-6-to-35-2-pp-on-every-benchmarkPairing weakest anchor agent with best evolver against strongest anchor with worst evolver, the strong agent still leads by 18.6 to 35.2 pp on every benchmark
Confirms that post-evolution performance bottleneck is on the agent side, not evolver side
Source paper
extracted_from(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13
Neighborhood — ranked by edge-count
Claims (1)
claim
- Practical implication of Observation 2 in evolver-side analysis
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.773Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
- Demonstrates that post-evolution score is dominated by agent base capability, not evolver identity
- Primary design recommendation derived from harness-updating flatness finding
- Case demonstrating that model scale does not predict harness-updating quality
- E3 negative control validating that both ρd AND dr must be favorable for S to exceed Sc
- Math and code tasks show strongest mid-layer anchoring on LLaMA (S ≈ −1.65 at layers 8-12)finding0.744Task-specific E3 finding showing compositional reasoning requires deeper processing
- Main interpretation of E3.
- Demonstrates that activation similarity can diverge from logit weight similarity due to interference