finding

active

finding:sonnet-4-5-trueskill-26-4-4-9-n-14-35-7-win-rate

Sonnet 4.5 TrueSkill μ=26.4 ± 4.9 (n=14, 35.7% win rate)

Mid-field performance with larger uncertainty due to small sample.

Source paper

extracted_from

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

(2026) · Robert Müller · Clemens Müller

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sonnet 4.5 win rate=35.7% (n=14)finding0.875
Sonnet's win rate in exploratory games
TrackerAgent TrueSkill μ=28.7 ± 3.6, 53.6% win ratefinding0.836
Best code agent outperforming six of seven LLMs.
G3.1-FL TrueSkill μ=28.0 ± 2.9, 44.9% win ratefinding0.836
Second-best LLM, competitive with TrackerAgent.
G3-F TrueSkill μ=30.1 ± 3.3, 72.9% wins, median score 5,250 on combined-comp1 slice (n=98 canonical games)finding0.802
Top LLM performance with high win rate and large score.
DS-v3.2 TrueSkill μ=23.9±2.7finding0.796
DeepSeek v3.2 TrueSkill rating
Qwen3-32B achieves a skill-load rate of 0.251, while Opus 4.6, Sonnet 4.6, and Qwen3-235B achieve SLR of 0.957–0.961finding0.790
Quantifies harness activation failure for weak-tier models vs. strong-tier models
GPT5.4-N TrueSkill μ=22.6±2.7finding0.784
GPT-5.4 Nano TrueSkill rating
On SWE-bench, Claude Opus 4.6 and Claude Sonnet 4.6 both achieve 7.4 pp harness-updating gain; Claude Haiku 4.5 achieves 8.0 ppfinding0.784
Full evolver-side SWE results showing comparable performance across Claude family tiers