finding

active

finding:gpt-oss-120b-adherence-drops-from-0-67-after-harness-loading-to-0-43-at-final-validation-drift-of-0-24

GPT-OSS-120B adherence drops from 0.67 after harness loading to 0.43 at final validation (drift of -0.24)

Mid-tier model shows moderate adherence drift compared to weak and strong tiers

Source paper

extracted_from

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

claim

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Qwen3-32B adherence drops from 0.52 after harness loading to 0.13 at final validation (drift of -0.39)finding0.855
Demonstrates long-horizon instruction-following bottleneck for weak-tier models
GPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolversfinding0.840
Part of full evolver-side matrix demonstrating flat but variable harness-updating across models
Opus 4.6 adherence remains stable from 0.89 after harness loading to 0.80 at final validation (drift of -0.09)finding0.833
Strong-tier model maintains harness adherence over long-horizon trajectories
GPT-OSS-120B achieves a skill-load rate of 0.446 on SkillsBenchfinding0.798
Mid-tier model showing intermediate activation rate between weak and strong tiers
On MCP-Atlas, harness-benefit peaks at GPT-OSS-120B (7.0 pp), with lower gains at both ends of the base-capability scalefinding0.791
Replication of non-monotonic harness-benefit pattern on a second benchmark
H6: Proprietary post-training resists prompt override — GPT-5.4 shows more resistance than GPT-OSS.hypothesis0.758
Exploratory hypothesis supported by GPT-5.4 vs GPT-OSS comparison
GPT-OSS 120Bconcept0.746
One of two large reasoning models analyzed in the paper for performative vs genuine CoT behavior
GPT5.4-N overbid rate=0.47%finding0.743
overbid rate for GPT-5.4 Nano