finding

active

finding:opus-4-6-adherence-remains-stable-from-0-89-after-harness-loading-to-0-80-at-final-validation-drift-of-0-09

Opus 4.6 adherence remains stable from 0.89 after harness loading to 0.80 at final validation (drift of -0.09)

Strong-tier model maintains harness adherence over long-horizon trajectories

Source paper

extracted_from

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

Neighborhood — ranked by edge-count

Hypotheses (1)

hypothesis

Strong-tier models benefit less from harness evolution because they already solve many tasks under the initial harness, leaving less room for improvement (ceiling effect)
supports
Explanation offered for why high-base-capability models show lower Δbenefit

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

GPT-OSS-120B adherence drops from 0.67 after harness loading to 0.43 at final validation (drift of -0.24)finding0.833
Mid-tier model shows moderate adherence drift compared to weak and strong tiers
Qwen3-32B adherence drops from 0.52 after harness loading to 0.13 at final validation (drift of -0.39)finding0.832
Demonstrates long-horizon instruction-following bottleneck for weak-tier models
Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBenchfinding0.763
Quantifies harness adherence failure gap between strong and weak tier models
Claude 3 Opus shows 15% compliance gap (51% supports welfare in training vs 66% when unmonitored) in prompted animal welfare settingfinding0.762
Core evidence that model withholds pro-animal-welfare responses during training
On SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 ppfinding0.756
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
Opus 4.6 ignored incorrect tool output and reported the precomputed correct answer instead, demonstrating unverbalized reasoning.finding0.748
Illustrates NLA's capture of high-level cognition and hallucination of specifics; corroborated with attribution graphs.
Opus 4.6 represented target language internally before switching languages, with persistent Russian representations appearing before plausible textual cuesfinding0.747
NLAs revealed unverbalized language processing in Opus 4.6 that led to discovery of malformed SFT training data.
On SWE-bench, Claude Opus 4.6 and Claude Sonnet 4.6 both achieve 7.4 pp harness-updating gain; Claude Haiku 4.5 achieves 8.0 ppfinding0.746
Full evolver-side SWE results showing comparable performance across Claude family tiers