finding
active
finding:opus-4-6-adherence-remains-stable-from-0-89-after-harness-loading-to-0-80-at-final-validation-drift-of-0-09Opus 4.6 adherence remains stable from 0.89 after harness loading to 0.80 at final validation (drift of -0.09)
Strong-tier model maintains harness adherence over long-horizon trajectories
Source paper
extracted_from(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13
Neighborhood — ranked by edge-count
Hypotheses (1)
hypothesis
- Explanation offered for why high-base-capability models show lower Δbenefit
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- GPT-OSS-120B adherence drops from 0.67 after harness loading to 0.43 at final validation (drift of -0.24)finding0.833Mid-tier model shows moderate adherence drift compared to weak and strong tiers
- Qwen3-32B adherence drops from 0.52 after harness loading to 0.13 at final validation (drift of -0.39)finding0.832Demonstrates long-horizon instruction-following bottleneck for weak-tier models
- Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBenchfinding0.763Quantifies harness adherence failure gap between strong and weak tier models
- Core evidence that model withholds pro-animal-welfare responses during training
- Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
- Illustrates NLA's capture of high-level cognition and hallucination of specifics; corroborated with attribution graphs.
- NLAs revealed unverbalized language processing in Opus 4.6 that led to discovery of malformed SFT training data.
- Full evolver-side SWE results showing comparable performance across Claude family tiers