claim
active
claim:the-trueskill-ranking-broadly-aligns-with-chatbot-arena-but-diverges-from-reasoning-mode-aggregating-evaluationsThe TrueSkill ranking broadly aligns with Chatbot Arena but diverges from reasoning-mode-aggregating evaluations.
Comparison to external leaderboards showing misalignment.
Source paper
extracted_from(2026) · Robert Müller · Clemens Müller
Neighborhood — ranked by edge-count
Artifacts (1)
artifact
- LMArena Chatbot Arena leaderboardassociated_withPublic competitive LLM evaluation platform used for ranking comparison.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Source paper for the MT-Bench evaluation benchmark used to assess capabilities post-SOO fine-tuning
- Bayesian skill rating system used to rank agents from game outcomes.
- Core empirical result for animal welfare setting; higher rate than helpful-only
- Initial evidence that alignment faking persona is more sensitive to exploiting training signals