finding
active
finding:opus-4-6-achieves-hfr-of-0-757-while-qwen3-32b-achieves-hfr-of-only-0-142-on-skillsbench

Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBench

Quantifies harness adherence failure gap between strong and weak tier models

Source paper

extracted_from
Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents
(2026) · Minhua Lin · Juncheng Wu · Zijun Wang · Zhan Shi +13

Neighborhood — ranked by edge-count

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.