claim
active
claim:the-trueskill-ranking-broadly-aligns-with-chatbot-arena-but-diverges-from-reasoning-mode-aggregating-evaluations

The TrueSkill ranking broadly aligns with Chatbot Arena but diverges from reasoning-mode-aggregating evaluations.

Comparison to external leaderboards showing misalignment.

Source paper

extracted_from
Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining
(2026) · Robert Müller · Clemens Müller

Neighborhood — ranked by edge-count

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.