finding
active
finding:gpt-oss-120b-adherence-drops-from-0-67-after-harness-loading-to-0-43-at-final-validation-drift-of-0-24GPT-OSS-120B adherence drops from 0.67 after harness loading to 0.43 at final validation (drift of -0.24)
Mid-tier model shows moderate adherence drift compared to weak and strong tiers
Source paper
extracted_from(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13
Neighborhood — ranked by edge-count
Claims (1)
claim
- Diagnosis of second failure mode explaining low harness-benefit for weak-tier models
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Qwen3-32B adherence drops from 0.52 after harness loading to 0.13 at final validation (drift of -0.39)finding0.855Demonstrates long-horizon instruction-following bottleneck for weak-tier models
- GPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolversfinding0.840Part of full evolver-side matrix demonstrating flat but variable harness-updating across models
- Strong-tier model maintains harness adherence over long-horizon trajectories
- Mid-tier model showing intermediate activation rate between weak and strong tiers
- Replication of non-monotonic harness-benefit pattern on a second benchmark
- H6: Proprietary post-training resists prompt override — GPT-5.4 shows more resistance than GPT-OSS.hypothesis0.758Exploratory hypothesis supported by GPT-5.4 vs GPT-OSS comparison
- One of two large reasoning models analyzed in the paper for performative vs genuine CoT behavior
- overbid rate for GPT-5.4 Nano