finding

active

finding:on-mcp-atlas-harness-benefit-peaks-at-gpt-oss-120b-7-0-pp-with-lower-gains-at-both-ends-of-the-base-capability-scale

On MCP-Atlas, harness-benefit peaks at GPT-OSS-120B (7.0 pp), with lower gains at both ends of the base-capability scale

Replication of non-monotonic harness-benefit pattern on a second benchmark

Source paper

extracted_from

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

Neighborhood — ranked by edge-count

Claims (1)

claim

Harness-benefit is non-monotonic in base capability: weak-tier models benefit little, mid-tier models benefit most, and strong-tier models benefit less than mid-tier
supports
Second major claim of the paper, supported by Δbenefit measurements across six models on three benchmarks

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

On SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 ppfinding0.838
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
GPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolversfinding0.832
Part of full evolver-side matrix demonstrating flat but variable harness-updating across models
GPT-OSS-120B adherence drops from 0.67 after harness loading to 0.43 at final validation (drift of -0.24)finding0.791
Mid-tier model shows moderate adherence drift compared to weak and strong tiers
harness-updating is flat in base capability: models across capability tiers produce updates that yield similar gains, and even the Qwen3.5-9B evolver induces gains comparable to Claude Opus 4.6quote0.772
Verbatim summary of first major finding from conclusion
ATLAS LA-GRPO achieves 51.3% on BLINK average, improving from baseline 22.8%finding0.762
Discrete functional tokens substantially improve structured visual reasoning on BLINK benchmark, a core validation of ATLAS effectiveness.
Harness-updating capability is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gainsclaim0.759
First major claim of the paper, supported by narrow spread across evolvers and case study
Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.758
Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
Qwen3-235B achieves only 1.1 pp harness-benefit on SkillsBench despite 4.7% base pass rate, near Qwen3-32B's 0.0% baselinefinding0.757
Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit