finding
active
finding:qwen3-32b-adherence-drops-from-0-52-after-harness-loading-to-0-13-at-final-validation-drift-of-0-39Qwen3-32B adherence drops from 0.52 after harness loading to 0.13 at final validation (drift of -0.39)
Demonstrates long-horizon instruction-following bottleneck for weak-tier models
Source paper
extracted_from(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13
Neighborhood — ranked by edge-count
Claims (1)
claim
- Diagnosis of second failure mode explaining low harness-benefit for weak-tier models
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- GPT-OSS-120B adherence drops from 0.67 after harness loading to 0.43 at final validation (drift of -0.24)finding0.855Mid-tier model shows moderate adherence drift compared to weak and strong tiers
- Strong-tier model maintains harness adherence over long-horizon trajectories
- Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit
- Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
- Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.766Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
- Diagnosis of first failure mode explaining low harness-benefit for weak-tier models
- Smaller models show non-monotonic and diminished ASR with increasing cone dimensionality
- Case demonstrating that model scale does not predict harness-updating quality