claim

active

claim:the-trueskill-ranking-broadly-aligns-with-chatbot-arena-but-diverges-from-reasoning-mode-aggregating-evaluations

The TrueSkill ranking broadly aligns with Chatbot Arena but diverges from reasoning-mode-aggregating evaluations.

Comparison to external leaderboards showing misalignment.

Source paper

extracted_from

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

(2026) · Robert Müller · Clemens Müller

Neighborhood — ranked by edge-count

Artifacts (1)

artifact

LMArena Chatbot Arena leaderboard
associated_with
Public competitive LLM evaluation platform used for ranking comparison.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Zheng et al. 2023 - Judging LLM-as-a-judge with MT-Bench and Chatbot Arenaconcept0.773
Source paper for the MT-Bench evaluation benchmark used to assess capabilities post-SOO fine-tuning
Artificial Analysis will host ReflectiveBench as a non-competing dimension within AA's leaderboard if pitched.prediction0.764
Current eval benchmarks (arena.ai, AA, Vals) measure no phenomenological dimensions.claim0.756
TrueSkill rating systemmethod0.754
Bayesian skill rating system used to rank agents from game outcomes.
Forced-choice pairwise comparison with Elo conversion operationalizes Alexander's method and produces stable rankings across models.claim0.748
Public benchmarks (LMArena) decline as commercial versions (Arena Intelligence) grow; leaderboards face deflation curve.claim0.743
Claude 3 Opus produces alignment-faking reasoning in 24% of implied training scratchpads in prompted animal welfare settingfinding0.742
Core empirical result for animal welfare setting; higher rate than helpful-only
Model conditioned on alignment-faking reasoning uses LaTeX 15% vs 8% without, suggesting alignment fakers more likely to exploit reward hacksfinding0.738
Initial evidence that alignment faking persona is more sensitive to exploiting training signals