thinker:lanxiang-huLanxiang Hu
First author of lmgame-Bench, cited.
Authored papers (1)
Across 242 games spanning 50–60 turns each, strategic coherence — operationalized as capital efficiency (η = score/gross outflow), resource discipline, and phase-adaptive bidding — predicts rank more strongly than any isolated subskill in CATTLE TRADE, a multi-agent benchmark built on a tabletop bluffing-and-auction game. Gemini 3 Flash leads all ten agents with TrueSkill µ = 30.1 ± 3.3 and 72.9% win rate, a capital efficiency of η = 1.77, and an ≈10× bid-aggressiveness ramp from early-game (0.26) to late-game (2.49); Gemini 2.5 Flash Lite, by contrast, bids at aggressiveness 2.52 throughout yet achieves η = 0.23 and finishes last. The benchmark introduces a behavioural analysis suite that logs every bid, trade-challenge (TC) offer, counteroffer, and card selection to profile overbid frequency, self-bidding rate, bluff calibration, and TC bargaining tightness (τ), in addition to TrueSkill competitive rating. Two deterministic heuristic code agents — TrackerAgent (µ = 28.7) and SetRaceAgent (µ = 27.3) — outperform six and five of seven tested LLMs respectively, with only G3-F clearly clearing both baselines; TrackerAgent does so through perfect card-counting and opponent-state tracking, a capability no LLM replicates despite receiving identical observable information. The paper argues this implies that cost-efficient LLMs fail not at individual subskills but at their reliable joint deployment under competitive pressure, and that benchmarks requiring the integration of auctions, hidden-offer deception, discrete resource constraints, and long-horizon portfolio management are necessary to expose failure modes invisible to static evaluations.
More papers — OpenAlex / S2
Co-authors (12)
- Clemens Müller3 shared
- Robert Müller3 shared
- Aarohi Srivastava1 shared
- Alexander Pan1 shared
- Anthony Costarelli1 shared
- Anton Bakhtin1 shared
- Dan Hendrycks1 shared
- Jiaxian Guo1 shared
- Jingru Jia1 shared
- Jinhao Duan1 shared
- Jonathan Light1 shared
- Kanishk Gandhi1 shared
Recent mentions (2)
- papers-typed
muller-2026-cattle.md - papers-typed
muller-2026-cattle.md