GPT-OSS-120B achieves a skill-load rate of 0.446 on SkillsBench

Mid-tier model showing intermediate activation rate between weak and strong tiers

Source paper

extracted_from

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

claim

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

GPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolversfinding0.810
Part of full evolver-side matrix demonstrating flat but variable harness-updating across models
GPT-OSS-120B adherence drops from 0.67 after harness loading to 0.43 at final validation (drift of -0.24)finding0.798
Mid-tier model shows moderate adherence drift compared to weak and strong tiers
GPT-OSS 120Bconcept0.784
One of two large reasoning models analyzed in the paper for performative vs genuine CoT behavior
Qwen3-32B achieves a skill-load rate of 0.251, while Opus 4.6, Sonnet 4.6, and Qwen3-235B achieve SLR of 0.957–0.961finding0.770
Quantifies harness activation failure for weak-tier models vs. strong-tier models
Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBenchfinding0.738
Quantifies harness adherence failure gap between strong and weak tier models
Qwen3-235B achieves only 1.1 pp harness-benefit on SkillsBench despite 4.7% base pass rate, near Qwen3-32B's 0.0% baselinefinding0.736
Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
SFR-DR-20B achieves 28.7% on Humanity's Last Exam full text-only benchmark, 65% relative improvement over gpt-oss-20b baseline.finding0.734
Main evaluation result showing best variant outperforms many proprietary and open-source baselines of comparable or larger sizes.
GPT5.4-N overbid rate=0.47%finding0.731
overbid rate for GPT-5.4 Nano