finding
active
finding:gpt-oss-120b-achieves-a-skill-load-rate-of-0-446-on-skillsbenchGPT-OSS-120B achieves a skill-load rate of 0.446 on SkillsBench
Mid-tier model showing intermediate activation rate between weak and strong tiers
Source paper
extracted_from(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13
Neighborhood — ranked by edge-count
Claims (1)
claim
- Diagnosis of first failure mode explaining low harness-benefit for weak-tier models
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- GPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolversfinding0.810Part of full evolver-side matrix demonstrating flat but variable harness-updating across models
- GPT-OSS-120B adherence drops from 0.67 after harness loading to 0.43 at final validation (drift of -0.24)finding0.798Mid-tier model shows moderate adherence drift compared to weak and strong tiers
- One of two large reasoning models analyzed in the paper for performative vs genuine CoT behavior
- Quantifies harness activation failure for weak-tier models vs. strong-tier models
- Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBenchfinding0.738Quantifies harness adherence failure gap between strong and weak tier models
- Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
- Main evaluation result showing best variant outperforms many proprietary and open-source baselines of comparable or larger sizes.
- overbid rate for GPT-5.4 Nano