finding
active
finding:sonnet-4-5-trueskill-26-4-4-9-n-14-35-7-win-rateSonnet 4.5 TrueSkill μ=26.4 ± 4.9 (n=14, 35.7% win rate)
Mid-field performance with larger uncertainty due to small sample.
Source paper
extracted_from(2026) · Robert Müller · Clemens Müller
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Sonnet's win rate in exploratory games
- Best code agent outperforming six of seven LLMs.
- Second-best LLM, competitive with TrackerAgent.
- G3-F TrueSkill μ=30.1 ± 3.3, 72.9% wins, median score 5,250 on combined-comp1 slice (n=98 canonical games)finding0.802Top LLM performance with high win rate and large score.
- DeepSeek v3.2 TrueSkill rating
- Quantifies harness activation failure for weak-tier models vs. strong-tier models
- GPT-5.4 Nano TrueSkill rating
- Full evolver-side SWE results showing comparable performance across Claude family tiers