Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

DOI 10.48550/arxiv.2605.14537 arXiv 2605.14537 OpenAlex W7161243123

adversarial interaction CATTLE TRADE behavioural analysis suite CATTLE TRADE 242-game dataset bankrupt TC initiation CATTLE TRADE benchmark bid aggressiveness deck order variance bluff percentage metric opponent modeling buy-right percentage metric overbidding canonical auction mode quartet completion+16 more

TL;DR

Across 242 games spanning 50–60 turns each, strategic coherence — operationalized as capital efficiency (η = score/gross outflow), resource discipline, and phase-adaptive bidding — predicts rank more strongly than any isolated subskill in CATTLE TRADE, a multi-agent benchmark built on a tabletop bluffing-and-auction game. Gemini 3 Flash leads all ten agents with TrueSkill µ = 30.1 ± 3.3 and 72.9% win rate, a capital efficiency of η = 1.77, and an ≈10× bid-aggressiveness ramp from early-game (0.26) to late-game (2.49); Gemini 2.5 Flash Lite, by contrast, bids at aggressiveness 2.52 throughout yet achieves η = 0.23 and finishes last. The benchmark introduces a behavioural analysis suite that logs every bid, trade-challenge (TC) offer, counteroffer, and card selection to profile overbid frequency, self-bidding rate, bluff calibration, and TC bargaining tightness (τ), in addition to TrueSkill competitive rating. Two deterministic heuristic code agents — TrackerAgent (µ = 28.7) and SetRaceAgent (µ = 27.3) — outperform six and five of seven tested LLMs respectively, with only G3-F clearly clearing both baselines; TrackerAgent does so through perfect card-counting and opponent-state tracking, a capability no LLM replicates despite receiving identical observable information. The paper argues this implies that cost-efficient LLMs fail not at individual subskills but at their reliable joint deployment under competitive pressure, and that benchmarks requiring the integration of auctions, hidden-offer deception, discrete resource constraints, and long-horizon portfolio management are necessary to expose failure modes invisible to static evaluations.

What to take away

1. Gemini 3 Flash achieves TrueSkill µ = 30.1 ± 3.3 and a 72.9% win rate across 98 canonical games, making it the only LLM to clearly outperform all three deterministic code-agent baselines.
2. TrackerAgent (µ = 28.7) and SetRaceAgent (µ = 27.3) beat six and five of seven LLMs respectively, demonstrating that deterministic card-counting and greedy quartet-pursuit heuristics suffice to outperform most cost-efficient language models in this setting.
3. Capital efficiency η (score per coin of gross outflow) separates the field sharply: G3-F achieves η = 1.77 and G3.1-FL achieves 1.46, while Gemini 2.5 Flash Lite scores η = 0.23 despite the highest bid aggressiveness of 2.52, paying an average of 1,193 coins per completed quartet versus 600–750 for top agents.
4. G3-F escalates bid aggressiveness ≈10× from early-game (0.26) to late-game (2.49) in lockstep with quartet-completion pressure, a phase-adaptation pattern also present in TrackerAgent (0.06 → 1.92) and SetRaceAgent (0.11 → 1.55) but absent in Gemini 2.5 Flash Lite (2.07 early, 2.08 late).
5. Self-bidding rate (raising with no competing bid since the agent's last bid) correlates inversely with performance among LLMs: G3.1-FL self-bids in under 7% of rounds while DeepSeek v3.2, GPT-5.4 Nano, and Gemini 2.5 Flash Lite self-bid in over 74%, with one DeepSeek trace incrementing 10 → 850 over 49 sole-bidder rounds.
6. Gemini 2.5 Flash Lite overbids in 1.20% of auctions (the highest rate among all tested agents, versus 0.00% for all three code agents), a mechanical failure that reveals wealth to all opponents, restarts the auction at zero, and directly compounds downstream cash depletion.
7. Token verbosity does not predict strategic quality: G3-F emits ≈1,500 completion tokens per call (≈275,000 per game) and ranks first, while Claude Haiku 4.5 emits the most completion tokens of any model yet ranks ninth, and G3.1-FL uses only ≈80 tokens per call (≈14,800 per game) while ranking third.
8. The benchmark introduces a behavioural analysis suite logging every bid, TC offer, counteroffer, and card selection, computing overbid frequency, self-bid rate, TC bargaining tightness τ, capital efficiency η, buy-right usage, bluff rate, and phase-dependent bid aggressiveness to diagnose why agents lose rather than just that they lose.
9. An open question the paper raises is whether the observed failure modes — overbidding, self-bidding spirals, bankrupt TC initiation, and weak opponent-state adaptation — reflect fundamental reasoning limitations of cost-efficient models or a cost-efficiency tradeoff that would disappear if frontier models were evaluated at full reasoning budgets.
10. To replicate the mixed-format calibration design, a researcher would run 168 games with a single LLM facing three deterministic code agents across seven fixed opponent-composition schedules (e.g., TrackerAgent + EconomyAgent + SetRaceAgent for C1, two TrackerAgents + EconomyAgent for C2), using temperature 0.1, reasoning effort low, a 4,096-token response limit, and TrueSkill for competitive rating.

Peer brief — for seminar discussion

CATTLE TRADE is a multi-agent benchmark built on a four-player tabletop card game (Kuhhandel, Ravensburger 1985) adapted into a Python game engine with a structured LLM agent framework. Seven cost-efficient LLMs — Gemini 3 Flash (G3-F), Gemini 3.1 Flash Lite (G3.1-FL), Claude Sonnet 4.5, Claude Haiku 4.5, DeepSeek v3.2, GPT-5.4 Nano, and Gemini 2.5 Flash Lite — and three deterministic code agents compete in 242 games of 50–60 turns each, evaluated by TrueSkill competitive rating and a behavioural analysis suite introduced here that tracks overbid frequency, self-bidding rate, TC bargaining tightness τ, capital efficiency η, and phase-adaptive bid aggressiveness. An alternative evaluation design could have used Elo ratings on pairwise outcomes rather than TrueSkill's Bayesian multi-player skill estimates, which would have lost information about within-game finish position and made it harder to handle the unbalanced game counts across agent pairs. The load-bearing finding is that strategic coherence — specifically capital efficiency η = score/gross outflow, resource discipline, and bid-timing adaptation — predicts rank more reliably than any isolated subskill across both pure-LLM and mixed-format games. G3-F achieves η = 1.77 with a 72.9% win rate (TrueSkill µ = 30.1 ± 3.3), while TrackerAgent (µ = 28.7, 53.6% wins) and SetRaceAgent (µ = 27.3) outperform six and five of seven LLMs respectively. Gemini 2.5 Flash Lite, the weakest agent, posts η = 0.23 despite the highest bid aggressiveness (2.52), pays on average 1,193 coins per completed quartet versus 600–750 for top agents, and overbids in 1.20% of auctions — a figure that is 0.00% for every code agent. G3-F scales bid intensity ≈10× from early- to late-game (0.26 → 2.49), a rational ramp also present in TrackerAgent (0.06 → 1.92); Gemini 2.5 Flash Lite shows no adaptation (2.07 → 2.08). Behavioural traces surface a specific failure mode: DeepSeek v3.2, whose 75.4% self-bidding rate included one trace incrementing from 10 to 850 coins over 49 uncontested rounds, suggesting the model treats its own previous bid as a competing signal. The paper predicts that the failure modes documented here — overbidding, self-bidding spirals, bankrupt TC initiation, and inadequate opponent-state adaptation — reflect not the absence of individual skills but their unreliable integration under competitive pressure, and that frontier models at full reasoning budgets might close the gap with heuristic baselines. The most pointed critique a careful reader would raise is that all seven LLMs are cost-efficient models running at low reasoning effort with a 4,096-token budget, so the finding that two simple heuristics outperform most of them may say more about model tier and inference budget than about LLM strategic reasoning in general. The paper acknowledges this but does not test even one frontier model (e.g., full Claude Sonnet or GPT-4o) at higher reasoning effort, leaving open whether the hierarchy collapses when capability constraints are relaxed. A secondary concern is that Sonnet 4.5 appears in only 14 games (versus 47–50 for the primary six), producing a TrueSkill σ ≈ 1.6 versus ≈ 1.0 for the others, which makes any comparative claim about its mid-field placement tentative. Finally, the benchmark's single-prompt design — every model receives identical neutral game rules plus "play optimally to maximize your expected score" — means prompt sensitivity is uncharacterized, and some documented failures (e.g., overbidding) may partly reflect numerical parsing or working-memory load from the natural-language observation format rather than strategic reasoning per se, a confound the code agents avoid by operating on structured data with exact arithmetic.

Methods (17)

behavioural analysis suite
suite profiling strategic play via spending efficiency, bluff rates, phase-dependent bid adaptation, self-bidding rates, and buy-right patterns
bid aggressiveness
mean of bid divided by quartet value of auctioned animal
bluff percentage metric
Fraction of an agent's TC offers consisting entirely of 0-value money cards.
buy-right percentage metric
Fraction of auctioneer decisions where the agent exercised buy-right.
canonical auction mode
auction mode with iterative call rounds where all non-auctioneer players submit bids simultaneously, faithful to tabletop rules
capital efficiency η
ratio of final score to gross outflow, measuring points per coin spent
cost per quartet metric
Total coins spent by an agent divided by quartets completed, measuring acquisition efficiency.
EconomyAgent
deterministic code agent that models resource economy, tracking money flows and exploiting cash-poor opponents
fast auction mode
auction mode with a single sealed bid per player
legacy auction mode
auction mode with sequential bidding
overbid frequency metric
Fraction of auctions where the agent bids more than its total money, triggering wealth revelation.
overbid rate
fraction of auctions in which an agent submitted a bid exceeding its total money, triggering wealth revelation penalty
self-bid rate
fraction of auction bids placed in rounds with no competing bid since the agent's last bid
SetRaceAgent
deterministic code agent that greedily pursues quartet completion, bidding aggressively on near-complete sets
TC-accept rate metric
Fraction of trade challenges resolved by accepting the face-down offer rather than countering.
TC bargaining tightness τ
size-weighted ratio of (loser offer + 10) to winner offer in counter-exchange wins, where 1 means winner paid only the minimum increment
TrackerAgent
deterministic code agent that maintains perfect information from observable events and makes greedy decisions conditioned on card counts and estimated wealth

Frameworks (2)

CATTLE TRADE
multi-agent benchmark for LLM bluffing, bidding, and bargaining, integrating auctions, hidden-offer trade challenges, and resource management
CATTLE TRADE benchmark
A multi-agent benchmark integrating auctions, hidden-offer trade challenges, bluffing, bargaining, and resource management over 50-60 turns with four players, evaluating LLMs and code agents.

Datasets (1)

CATTLE TRADE 242-game dataset
Full dataset of 242 games (228 primary + 14 exploratory) logging every bid, TC offer, counteroffer, and card selection across 7 LLMs and 3 code agents.

Findings (50)

Gemini 3 Flash completes fourth quartet by paying far above face value, netting ≈1,800 points from multiplicative scoring
A trace shows G3-F turning a nominally wasteful overpay into a net score gain due to the multiplicative formula.
G3-F conditions TC offers on opponent wealth and game context, e.g., 0-value bluffs against bankrupt opponents
sophisticated bluff calibration
G3-F TrueSkill μ=30.1 ± 3.3, 72.9% wins, median score 5,250 on combined-comp1 slice (n=98 canonical games)
Top LLM performance with high win rate and large score.
Token usage varies roughly 20× across models, from ~14,800 (G3.1-FL) to ~275,000 (G3-F) per game
Reasoning verbosity does not predict strategic strength: both top and weak models span a wide range of token usage.
G2.5-FL initiates a trade challenge for a goose with zero money cards, offering 0-value bluff
In one trace, G2.5-FL depleted money through overbidding and launched a TC with no resources, failing to condition action on resource state.
Within-agent score std exceeds cross-seat win-rate differentials by 1–2 orders of magnitude
deck-order variance dominates seat-position variance
Hardest composition for LLMs: two TrackerAgents (C2, C7), only G3-F still wins majority
card-counting pressure compounds with multiple TrackerAgents
Claude Haiku 4.5 and GPT-5.4 Nano have TC tightness τ ≈ 0.4, the tightest among all
These two LLMs bargain with minimal overpayment but low overall efficiency.
TrackerAgent and SetRaceAgent have TC tightness τ ≈ 0.2–0.25, looser counters
Code agents trade bargaining precision for acquisition pressure.
Gemini 2.5 Flash Lite bid aggressiveness stays flat (~2.07 early, 2.08 late)
G2.5-FL shows no phase adaptation in bidding intensity.

Claims (25)

Code agents operate on structured data with exact arithmetic, while LLMs must parse natural-language observations and track state across turns; some failures may partly reflect numerical parsing or working-memory limitations
discussion of potential confounds
Multi-turn strategic play depends on capabilities (state tracking, adaptive resource allocation, structured-output reliability) that static benchmarks do not measure but conversational evaluations partially capture
explains divergence from static benchmarks
Overbid frequency, self-bidding rate, bankrupt-initiation patterns, and context-dependent offer calibration are failure modes invisible to both static evaluations and aggregate rankings like Elo
key claim about the benchmark's unique diagnostic value
Benchmarks of this kind test whether models can sustain strategic coherence over time, manage resource constraints, and adapt interactively — capabilities that static benchmarks do not measure.
Broader methodological claim about the need for multi-agent, long-horizon benchmarks.
The code-agent ordering (TrackerAgent > SetRaceAgent > EconomyAgent) shows information exploitation matters more than greedy quartet-chasing, which in turn outperforms conservative budgeting
interpretation of what drives success among deterministic strategies
Behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation that never appear in code agents.
LLMs exhibit systematic errors that deterministic logic avoids.
Strategic coherence (spending efficiency, resource discipline, phase-adaptive bidding) is associated with rank more strongly than spending volume or any single subskill
core interpretive claim about what separates strong from weak play
Some failures may reflect prompt design rather than model limitations, but the underlying issue is one of reasoning rather than instruction-following.
Acknowledges the confound of not explicitly instructing models to track wealth, yet points to reasoning gaps given code agents avoid errors without prompts.
Some failures may reflect prompt design rather than model limitations, though code agents avoid errors without prompts
noted as a possible confound
Strategic coherence in turn (spending efficiency, resource discipline, adaptive phase play) is associated with success
summary claim linking measured traits to outcomes

Questions (6)

Can models sustain strategic coherence over time, manage resource constraints, and adapt interactively in multi-agent environments with conflicting incentives?
broader framing question for the benchmark
Do LLM failures in CATTLE TRADE reflect genuinely hard strategic problems or errors that novice humans also avoid?
Open question about benchmarking against human players to calibrate difficulty.
Do these failure modes (overbidding, self-bidding, bankrupt initiation) generalise to other economic settings?
Remains untested whether the specific LLM failures observed in CATTLE TRADE extend beyond this game.
Does a high self-bidding rate reflect a failure to detect non-competitive contexts or a deliberate escalation?
Ambiguity in interpreting the self-bidding metric: from a single trace, cannot distinguish error from aggressive strategy.
Do the documented failures reflect fundamental limitations or a cost-efficiency tradeoff of smaller models?
question for future work on frontier models
Do these failure modes generalise to other economic settings?
open question from discussion

Original abstract (expand)

We introduce \textsc{Cattle Trade, a multi-agent benchmark for evaluating large language models (LLMs) as agents in strategic reasoning under imperfect information, adversarial interaction, and resource constraints. The benchmark combines auctions, hidden-offer trade challenges (TCs), bargaining, bluffing, opponent modeling, and resource allocation within a single long-horizon game lasting 50--60 turns. Unlike prior agent benchmarks that test these abilities in isolation, \textsc{Cattle Trade} evaluates whether agents integrate them across a competitive, multi-agent economic game with conflicting incentives. The benchmark logs every bid, TC offer, counteroffer, and card selection, enabling behavioural analysis beyond final scores or win rates. We evaluate seven cost-efficient language models and three deterministic code agents across 242 games. Strategic coherence, in particular spending efficiency, resource discipline, and phase-adaptive bidding, is associated with rank more strongly than spending volume or any single subskill. Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation. Evaluating agentic competence requires benchmarks that test the joint deployment of multiple capabilities in multi-agent environments with conflicting incentives, uncertainty, and economic dynamics.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents
Kaousheik Jayakumar, Aswinkumar Ramkumar, Pavan Thodima, Aniket Rege, Emmanouil-Vasileios Vlatakis-Gkaragkounis Mahesh Ramesh
2026
≈ 82%
MASteer: Multi-Agent Adaptive Steer Strategy for End-to-End LLM Trustworthiness Repair
Tianlin Li, Xiaohan Zhang, Aishan Liu, Li Pan Changqing Li
2025
≈ 82%
PieArena: Frontier Language Agents Achieve MBA-Level Negotiation Performance and Reveal Novel Behavioral Differences
Sasha Cui, Will Sanok Dufallo, Runzhi Jin, Zhen Xu, Linjun Zhang, Daylian Cain Chris Zhu
2026
≈ 82%
Scaling Small Agents Through Strategy Auctions
William F. Shen, Yoram Bachrach, Akhil Mathur Lisa Alazraki
2026
≈ 81%
Token Economics for LLM Agents: A Dual-View Study from Computing and Economics
Junming Chen, Chenyu He, Yiwei Li, Yicheng Ji, Yifan Wu, Dingyu Yang, Lansong Diao, Lidan Shou, Hongliang Zhang, Huan Li, Gang Chen Yuxi Chen
2026
≈ 81%
Evaluating and Enhancing LLMs Agent based on Theory of Mind in Guandan: A Multi-Player Cooperative Game under Imperfect Information
Chunkit Chan, Tianyu Shi, Zheye Deng, Wei Fan, Tianshi Zheng, Yangqiu Song Yauwai Yim
2024
≈ 81%
HARBOR: Exploring Persona Dynamics in Multi-Agent Competition
Li Xiong, Fei Liu Kenan Jiang
2025
≈ 81%
TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs
Xiachong Feng, Lei Li, Yu Guo, Zhanyue Qin, Dianbo Sui, Lingpeng Kong Haochuan Wang
2025
≈ 81%
Evaluating Multi-Turn Bargain Skills in LLM-Based Seller Agent
Kakam Chong, Xiaofeng Wang, Xu Yan, DeXin Kong, Chen Ju, Ming Chen, Shuai Xiao, Shuguang Han, jufeng chen Issue Yishu Wang
2025
≈ 81%
Codenames as a Benchmark for Large Language Models
Matthew Sidji, Beno\^it Ronval Matthew Stephenson
2025
≈ 80%
The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind
Timon Willi, Jakob Foerster Andrei Lupu
2025
≈ 80%
Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning
Jakob N Foerster Hengyuan Hu
2021
≈ 80%
Towards a Standard, Enterprise-Relevant Agentic AI Benchmark: Lessons from 5.5 billion tokens' worth of agentic AI evaluations
JV Roig
2025
≈ 80%
LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models
Yue Fan, Anthony Reyna, Xin Eric Wang Saaket Agashe
2025
≈ 80%
Semantic Trading: Agentic AI for Clustering and Relationship Discovery in Prediction Markets
Alfio Gliozzo, Brian Zhu Agostino Capponi
2025
≈ 80%
Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents
in corpus
2026
≈ 78%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 76%
The Platonic Representation Hypothesis
in corpus
2024
≈ 75%
Alignment faking in large language models
in corpus
2024
≈ 75%
SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
in corpus
2025
≈ 75%
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
in corpus
2024
≈ 75%
Contemplative Agent
in corpus
2025
≈ 75%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 75%
The Guanyin Protocol: A Framework for Immediately Establishing an Understanding of Both Causality and Compassion in LLM Systems Using Semantic Anchoring
in corpus
2025
≈ 75%
Testing the Limits of Truth Directions in LLMs
in corpus
2026
≈ 75%
Cognitive glues are shared models of relative scarcities: the economics of collective intelligence
in corpus
2026
≈ 75%
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
in corpus
2025
≈ 74%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 74%
Taking AI Welfare Seriously
in corpus
2024
≈ 74%
Artificial Analysis
cited

+29 more