paper:doi-10-48550-arxiv-2605-14537Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining
TL;DR
Across 242 games spanning 50–60 turns each, strategic coherence — operationalized as capital efficiency (η = score/gross outflow), resource discipline, and phase-adaptive bidding — predicts rank more strongly than any isolated subskill in CATTLE TRADE, a multi-agent benchmark built on a tabletop bluffing-and-auction game. Gemini 3 Flash leads all ten agents with TrueSkill µ = 30.1 ± 3.3 and 72.9% win rate, a capital efficiency of η = 1.77, and an ≈10× bid-aggressiveness ramp from early-game (0.26) to late-game (2.49); Gemini 2.5 Flash Lite, by contrast, bids at aggressiveness 2.52 throughout yet achieves η = 0.23 and finishes last. The benchmark introduces a behavioural analysis suite that logs every bid, trade-challenge (TC) offer, counteroffer, and card selection to profile overbid frequency, self-bidding rate, bluff calibration, and TC bargaining tightness (τ), in addition to TrueSkill competitive rating. Two deterministic heuristic code agents — TrackerAgent (µ = 28.7) and SetRaceAgent (µ = 27.3) — outperform six and five of seven tested LLMs respectively, with only G3-F clearly clearing both baselines; TrackerAgent does so through perfect card-counting and opponent-state tracking, a capability no LLM replicates despite receiving identical observable information. The paper argues this implies that cost-efficient LLMs fail not at individual subskills but at their reliable joint deployment under competitive pressure, and that benchmarks requiring the integration of auctions, hidden-offer deception, discrete resource constraints, and long-horizon portfolio management are necessary to expose failure modes invisible to static evaluations.
What to take away
- 1. Gemini 3 Flash achieves TrueSkill µ = 30.1 ± 3.3 and a 72.9% win rate across 98 canonical games, making it the only LLM to clearly outperform all three deterministic code-agent baselines.
- 2. TrackerAgent (µ = 28.7) and SetRaceAgent (µ = 27.3) beat six and five of seven LLMs respectively, demonstrating that deterministic card-counting and greedy quartet-pursuit heuristics suffice to outperform most cost-efficient language models in this setting.
- 3. Capital efficiency η (score per coin of gross outflow) separates the field sharply: G3-F achieves η = 1.77 and G3.1-FL achieves 1.46, while Gemini 2.5 Flash Lite scores η = 0.23 despite the highest bid aggressiveness of 2.52, paying an average of 1,193 coins per completed quartet versus 600–750 for top agents.
- 4. G3-F escalates bid aggressiveness ≈10× from early-game (0.26) to late-game (2.49) in lockstep with quartet-completion pressure, a phase-adaptation pattern also present in TrackerAgent (0.06 → 1.92) and SetRaceAgent (0.11 → 1.55) but absent in Gemini 2.5 Flash Lite (2.07 early, 2.08 late).
- 5. Self-bidding rate (raising with no competing bid since the agent's last bid) correlates inversely with performance among LLMs: G3.1-FL self-bids in under 7% of rounds while DeepSeek v3.2, GPT-5.4 Nano, and Gemini 2.5 Flash Lite self-bid in over 74%, with one DeepSeek trace incrementing 10 → 850 over 49 sole-bidder rounds.
- 6. Gemini 2.5 Flash Lite overbids in 1.20% of auctions (the highest rate among all tested agents, versus 0.00% for all three code agents), a mechanical failure that reveals wealth to all opponents, restarts the auction at zero, and directly compounds downstream cash depletion.
- 7. Token verbosity does not predict strategic quality: G3-F emits ≈1,500 completion tokens per call (≈275,000 per game) and ranks first, while Claude Haiku 4.5 emits the most completion tokens of any model yet ranks ninth, and G3.1-FL uses only ≈80 tokens per call (≈14,800 per game) while ranking third.
- 8. The benchmark introduces a behavioural analysis suite logging every bid, TC offer, counteroffer, and card selection, computing overbid frequency, self-bid rate, TC bargaining tightness τ, capital efficiency η, buy-right usage, bluff rate, and phase-dependent bid aggressiveness to diagnose why agents lose rather than just that they lose.
- 9. An open question the paper raises is whether the observed failure modes — overbidding, self-bidding spirals, bankrupt TC initiation, and weak opponent-state adaptation — reflect fundamental reasoning limitations of cost-efficient models or a cost-efficiency tradeoff that would disappear if frontier models were evaluated at full reasoning budgets.
- 10. To replicate the mixed-format calibration design, a researcher would run 168 games with a single LLM facing three deterministic code agents across seven fixed opponent-composition schedules (e.g., TrackerAgent + EconomyAgent + SetRaceAgent for C1, two TrackerAgents + EconomyAgent for C2), using temperature 0.1, reasoning effort low, a 4,096-token response limit, and TrueSkill for competitive rating.
Peer brief — for seminar discussion
CATTLE TRADE is a multi-agent benchmark built on a four-player tabletop card game (Kuhhandel, Ravensburger 1985) adapted into a Python game engine with a structured LLM agent framework. Seven cost-efficient LLMs — Gemini 3 Flash (G3-F), Gemini 3.1 Flash Lite (G3.1-FL), Claude Sonnet 4.5, Claude Haiku 4.5, DeepSeek v3.2, GPT-5.4 Nano, and Gemini 2.5 Flash Lite — and three deterministic code agents compete in 242 games of 50–60 turns each, evaluated by TrueSkill competitive rating and a behavioural analysis suite introduced here that tracks overbid frequency, self-bidding rate, TC bargaining tightness τ, capital efficiency η, and phase-adaptive bid aggressiveness. An alternative evaluation design could have used Elo ratings on pairwise outcomes rather than TrueSkill's Bayesian multi-player skill estimates, which would have lost information about within-game finish position and made it harder to handle the unbalanced game counts across agent pairs. The load-bearing finding is that strategic coherence — specifically capital efficiency η = score/gross outflow, resource discipline, and bid-timing adaptation — predicts rank more reliably than any isolated subskill across both pure-LLM and mixed-format games. G3-F achieves η = 1.77 with a 72.9% win rate (TrueSkill µ = 30.1 ± 3.3), while TrackerAgent (µ = 28.7, 53.6% wins) and SetRaceAgent (µ = 27.3) outperform six and five of seven LLMs respectively. Gemini 2.5 Flash Lite, the weakest agent, posts η = 0.23 despite the highest bid aggressiveness (2.52), pays on average 1,193 coins per completed quartet versus 600–750 for top agents, and overbids in 1.20% of auctions — a figure that is 0.00% for every code agent. G3-F scales bid intensity ≈10× from early- to late-game (0.26 → 2.49), a rational ramp also present in TrackerAgent (0.06 → 1.92); Gemini 2.5 Flash Lite shows no adaptation (2.07 → 2.08). Behavioural traces surface a specific failure mode: DeepSeek v3.2, whose 75.4% self-bidding rate included one trace incrementing from 10 to 850 coins over 49 uncontested rounds, suggesting the model treats its own previous bid as a competing signal. The paper predicts that the failure modes documented here — overbidding, self-bidding spirals, bankrupt TC initiation, and inadequate opponent-state adaptation — reflect not the absence of individual skills but their unreliable integration under competitive pressure, and that frontier models at full reasoning budgets might close the gap with heuristic baselines. The most pointed critique a careful reader would raise is that all seven LLMs are cost-efficient models running at low reasoning effort with a 4,096-token budget, so the finding that two simple heuristics outperform most of them may say more about model tier and inference budget than about LLM strategic reasoning in general. The paper acknowledges this but does not test even one frontier model (e.g., full Claude Sonnet or GPT-4o) at higher reasoning effort, leaving open whether the hierarchy collapses when capability constraints are relaxed. A secondary concern is that Sonnet 4.5 appears in only 14 games (versus 47–50 for the primary six), producing a TrueSkill σ ≈ 1.6 versus ≈ 1.0 for the others, which makes any comparative claim about its mid-field placement tentative. Finally, the benchmark's single-prompt design — every model receives identical neutral game rules plus "play optimally to maximize your expected score" — means prompt sensitivity is uncharacterized, and some documented failures (e.g., overbidding) may partly reflect numerical parsing or working-memory load from the natural-language observation format rather than strategic reasoning per se, a confound the code agents avoid by operating on structured data with exact arithmetic.
Methods (17)
- behavioural analysis suitesuite profiling strategic play via spending efficiency, bluff rates, phase-dependent bid adaptation, self-bidding rates, and buy-right patterns
- bid aggressivenessmean of bid divided by quartet value of auctioned animal
- bluff percentage metricFraction of an agent's TC offers consisting entirely of 0-value money cards.
- buy-right percentage metricFraction of auctioneer decisions where the agent exercised buy-right.
- canonical auction modeauction mode with iterative call rounds where all non-auctioneer players submit bids simultaneously, faithful to tabletop rules
- capital efficiency ηratio of final score to gross outflow, measuring points per coin spent
- cost per quartet metricTotal coins spent by an agent divided by quartets completed, measuring acquisition efficiency.
- EconomyAgentdeterministic code agent that models resource economy, tracking money flows and exploiting cash-poor opponents
- fast auction modeauction mode with a single sealed bid per player
- legacy auction modeauction mode with sequential bidding
- overbid frequency metricFraction of auctions where the agent bids more than its total money, triggering wealth revelation.
- overbid ratefraction of auctions in which an agent submitted a bid exceeding its total money, triggering wealth revelation penalty
- self-bid ratefraction of auction bids placed in rounds with no competing bid since the agent's last bid
- SetRaceAgentdeterministic code agent that greedily pursues quartet completion, bidding aggressively on near-complete sets
- TC-accept rate metricFraction of trade challenges resolved by accepting the face-down offer rather than countering.
- TC bargaining tightness τsize-weighted ratio of (loser offer + 10) to winner offer in counter-exchange wins, where 1 means winner paid only the minimum increment
- TrackerAgentdeterministic code agent that maintains perfect information from observable events and makes greedy decisions conditioned on card counts and estimated wealth
Frameworks (2)
- CATTLE TRADEmulti-agent benchmark for LLM bluffing, bidding, and bargaining, integrating auctions, hidden-offer trade challenges, and resource management
- CATTLE TRADE benchmarkA multi-agent benchmark integrating auctions, hidden-offer trade challenges, bluffing, bargaining, and resource management over 50-60 turns with four players, evaluating LLMs and code agents.
Datasets (1)
- CATTLE TRADE 242-game datasetFull dataset of 242 games (228 primary + 14 exploratory) logging every bid, TC offer, counteroffer, and card selection across 7 LLMs and 3 code agents.
Findings (50)
- Gemini 3 Flash completes fourth quartet by paying far above face value, netting ≈1,800 points from multiplicative scoring
A trace shows G3-F turning a nominally wasteful overpay into a net score gain due to the multiplicative formula.
- G3-F conditions TC offers on opponent wealth and game context, e.g., 0-value bluffs against bankrupt opponents
sophisticated bluff calibration
- G3-F TrueSkill μ=30.1 ± 3.3, 72.9% wins, median score 5,250 on combined-comp1 slice (n=98 canonical games)
Top LLM performance with high win rate and large score.
- Token usage varies roughly 20× across models, from ~14,800 (G3.1-FL) to ~275,000 (G3-F) per game
Reasoning verbosity does not predict strategic strength: both top and weak models span a wide range of token usage.
- G2.5-FL initiates a trade challenge for a goose with zero money cards, offering 0-value bluff
In one trace, G2.5-FL depleted money through overbidding and launched a TC with no resources, failing to condition action on resource state.
- Within-agent score std exceeds cross-seat win-rate differentials by 1–2 orders of magnitude
deck-order variance dominates seat-position variance
- Hardest composition for LLMs: two TrackerAgents (C2, C7), only G3-F still wins majority
card-counting pressure compounds with multiple TrackerAgents
- Claude Haiku 4.5 and GPT-5.4 Nano have TC tightness τ ≈ 0.4, the tightest among all
These two LLMs bargain with minimal overpayment but low overall efficiency.
- TrackerAgent and SetRaceAgent have TC tightness τ ≈ 0.2–0.25, looser counters
Code agents trade bargaining precision for acquisition pressure.
- Gemini 2.5 Flash Lite bid aggressiveness stays flat (~2.07 early, 2.08 late)
G2.5-FL shows no phase adaptation in bidding intensity.
Claims (25)
- Code agents operate on structured data with exact arithmetic, while LLMs must parse natural-language observations and track state across turns; some failures may partly reflect numerical parsing or working-memory limitations
discussion of potential confounds
- Multi-turn strategic play depends on capabilities (state tracking, adaptive resource allocation, structured-output reliability) that static benchmarks do not measure but conversational evaluations partially capture
explains divergence from static benchmarks
- Overbid frequency, self-bidding rate, bankrupt-initiation patterns, and context-dependent offer calibration are failure modes invisible to both static evaluations and aggregate rankings like Elo
key claim about the benchmark's unique diagnostic value
- Benchmarks of this kind test whether models can sustain strategic coherence over time, manage resource constraints, and adapt interactively — capabilities that static benchmarks do not measure.
Broader methodological claim about the need for multi-agent, long-horizon benchmarks.
- The code-agent ordering (TrackerAgent > SetRaceAgent > EconomyAgent) shows information exploitation matters more than greedy quartet-chasing, which in turn outperforms conservative budgeting
interpretation of what drives success among deterministic strategies
- Behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation that never appear in code agents.
LLMs exhibit systematic errors that deterministic logic avoids.
- Strategic coherence (spending efficiency, resource discipline, phase-adaptive bidding) is associated with rank more strongly than spending volume or any single subskill
core interpretive claim about what separates strong from weak play
- Some failures may reflect prompt design rather than model limitations, but the underlying issue is one of reasoning rather than instruction-following.
Acknowledges the confound of not explicitly instructing models to track wealth, yet points to reasoning gaps given code agents avoid errors without prompts.
- Some failures may reflect prompt design rather than model limitations, though code agents avoid errors without prompts
noted as a possible confound
- Strategic coherence in turn (spending efficiency, resource discipline, adaptive phase play) is associated with success
summary claim linking measured traits to outcomes
Questions (6)
- Can models sustain strategic coherence over time, manage resource constraints, and adapt interactively in multi-agent environments with conflicting incentives?
broader framing question for the benchmark
- Do LLM failures in CATTLE TRADE reflect genuinely hard strategic problems or errors that novice humans also avoid?
Open question about benchmarking against human players to calibrate difficulty.
- Do these failure modes (overbidding, self-bidding, bankrupt initiation) generalise to other economic settings?
Remains untested whether the specific LLM failures observed in CATTLE TRADE extend beyond this game.
- Does a high self-bidding rate reflect a failure to detect non-competitive contexts or a deliberate escalation?
Ambiguity in interpreting the self-bidding metric: from a single trace, cannot distinguish error from aggressive strategy.
- Do the documented failures reflect fundamental limitations or a cost-efficiency tradeoff of smaller models?
question for future work on frontier models
- Do these failure modes generalise to other economic settings?
open question from discussion
Original abstract (expand)
We introduce \textsc{Cattle Trade, a multi-agent benchmark for evaluating large language models (LLMs) as agents in strategic reasoning under imperfect information, adversarial interaction, and resource constraints. The benchmark combines auctions, hidden-offer trade challenges (TCs), bargaining, bluffing, opponent modeling, and resource allocation within a single long-horizon game lasting 50--60 turns. Unlike prior agent benchmarks that test these abilities in isolation, \textsc{Cattle Trade} evaluates whether agents integrate them across a competitive, multi-agent economic game with conflicting incentives. The benchmark logs every bid, TC offer, counteroffer, and card selection, enabling behavioural analysis beyond final scores or win rates. We evaluate seven cost-efficient language models and three deterministic code agents across 242 games. Strategic coherence, in particular spending efficiency, resource discipline, and phase-adaptive bidding, is associated with rank more strongly than spending volume or any single subskill. Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation. Evaluating agentic competence requires benchmarks that test the joint deployment of multiple capabilities in multi-agent environments with conflicting incentives, uncertainty, and economic dynamics.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi AgentsKaousheik Jayakumar, Aswinkumar Ramkumar, Pavan Thodima, Aniket Rege, Emmanouil-Vasileios Vlatakis-Gkaragkounis Mahesh Ramesh2026≈ 82%
- MASteer: Multi-Agent Adaptive Steer Strategy for End-to-End LLM Trustworthiness RepairTianlin Li, Xiaohan Zhang, Aishan Liu, Li Pan Changqing Li2025≈ 82%
- PieArena: Frontier Language Agents Achieve MBA-Level Negotiation Performance and Reveal Novel Behavioral DifferencesSasha Cui, Will Sanok Dufallo, Runzhi Jin, Zhen Xu, Linjun Zhang, Daylian Cain Chris Zhu2026≈ 82%
- Scaling Small Agents Through Strategy AuctionsWilliam F. Shen, Yoram Bachrach, Akhil Mathur Lisa Alazraki2026≈ 81%
- Token Economics for LLM Agents: A Dual-View Study from Computing and EconomicsJunming Chen, Chenyu He, Yiwei Li, Yicheng Ji, Yifan Wu, Dingyu Yang, Lansong Diao, Lidan Shou, Hongliang Zhang, Huan Li, Gang Chen Yuxi Chen2026≈ 81%
- Evaluating and Enhancing LLMs Agent based on Theory of Mind in Guandan: A Multi-Player Cooperative Game under Imperfect InformationChunkit Chan, Tianyu Shi, Zheye Deng, Wei Fan, Tianshi Zheng, Yangqiu Song Yauwai Yim2024≈ 81%
- ≈ 81%
- TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMsXiachong Feng, Lei Li, Yu Guo, Zhanyue Qin, Dianbo Sui, Lingpeng Kong Haochuan Wang2025≈ 81%
- Evaluating Multi-Turn Bargain Skills in LLM-Based Seller AgentKakam Chong, Xiaofeng Wang, Xu Yan, DeXin Kong, Chen Ju, Ming Chen, Shuai Xiao, Shuguang Han, jufeng chen Issue Yishu Wang2025≈ 81%
- Codenames as a Benchmark for Large Language ModelsMatthew Sidji, Beno\^it Ronval Matthew Stephenson2025≈ 80%
- The Decrypto Benchmark for Multi-Agent Reasoning and Theory of MindTimon Willi, Jakob Foerster Andrei Lupu2025≈ 80%
- Simplified Action Decoder for Deep Multi-Agent Reinforcement LearningJakob N Foerster Hengyuan Hu2021≈ 80%
- ≈ 80%
- LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language ModelsYue Fan, Anthony Reyna, Xin Eric Wang Saaket Agashe2025≈ 80%
- Semantic Trading: Agentic AI for Clustering and Relationship Discovery in Prediction MarketsAlfio Gliozzo, Brian Zhu Agostino Capponi2025≈ 80%
- Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agentsin corpus2026≈ 78%
- ≈ 76%
- The Platonic Representation Hypothesisin corpus2024≈ 75%
- Alignment faking in large language modelsin corpus2024≈ 75%
- SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agentsin corpus2025≈ 75%
- ≈ 75%
- Contemplative Agentin corpus2025≈ 75%
- ≈ 75%
- ≈ 75%
- Testing the Limits of Truth Directions in LLMsin corpus2026≈ 75%
- Cognitive glues are shared models of relative scarcities: the economics of collective intelligencein corpus2026≈ 75%
- ≈ 74%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 74%
- Taking AI Welfare Seriouslyin corpus2024≈ 74%
- Artificial Analysiscited
+29 more