paper
referenced-only
2026
paper:doi-10-48550-arxiv-2605-14537

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

TL;DR

Across 242 games spanning 50–60 turns each, strategic coherence — operationalized as capital efficiency (η = score/gross outflow), resource discipline, and phase-adaptive bidding — predicts rank more strongly than any isolated subskill in CATTLE TRADE, a multi-agent benchmark built on a tabletop bluffing-and-auction game. Gemini 3 Flash leads all ten agents with TrueSkill µ = 30.1 ± 3.3 and 72.9% win rate, a capital efficiency of η = 1.77, and an ≈10× bid-aggressiveness ramp from early-game (0.26) to late-game (2.49); Gemini 2.5 Flash Lite, by contrast, bids at aggressiveness 2.52 throughout yet achieves η = 0.23 and finishes last. The benchmark introduces a behavioural analysis suite that logs every bid, trade-challenge (TC) offer, counteroffer, and card selection to profile overbid frequency, self-bidding rate, bluff calibration, and TC bargaining tightness (τ), in addition to TrueSkill competitive rating. Two deterministic heuristic code agents — TrackerAgent (µ = 28.7) and SetRaceAgent (µ = 27.3) — outperform six and five of seven tested LLMs respectively, with only G3-F clearly clearing both baselines; TrackerAgent does so through perfect card-counting and opponent-state tracking, a capability no LLM replicates despite receiving identical observable information. The paper argues this implies that cost-efficient LLMs fail not at individual subskills but at their reliable joint deployment under competitive pressure, and that benchmarks requiring the integration of auctions, hidden-offer deception, discrete resource constraints, and long-horizon portfolio management are necessary to expose failure modes invisible to static evaluations.

What to take away

  1. 1. Gemini 3 Flash achieves TrueSkill µ = 30.1 ± 3.3 and a 72.9% win rate across 98 canonical games, making it the only LLM to clearly outperform all three deterministic code-agent baselines.
  2. 2. TrackerAgent (µ = 28.7) and SetRaceAgent (µ = 27.3) beat six and five of seven LLMs respectively, demonstrating that deterministic card-counting and greedy quartet-pursuit heuristics suffice to outperform most cost-efficient language models in this setting.
  3. 3. Capital efficiency η (score per coin of gross outflow) separates the field sharply: G3-F achieves η = 1.77 and G3.1-FL achieves 1.46, while Gemini 2.5 Flash Lite scores η = 0.23 despite the highest bid aggressiveness of 2.52, paying an average of 1,193 coins per completed quartet versus 600–750 for top agents.
  4. 4. G3-F escalates bid aggressiveness ≈10× from early-game (0.26) to late-game (2.49) in lockstep with quartet-completion pressure, a phase-adaptation pattern also present in TrackerAgent (0.06 → 1.92) and SetRaceAgent (0.11 → 1.55) but absent in Gemini 2.5 Flash Lite (2.07 early, 2.08 late).
  5. 5. Self-bidding rate (raising with no competing bid since the agent's last bid) correlates inversely with performance among LLMs: G3.1-FL self-bids in under 7% of rounds while DeepSeek v3.2, GPT-5.4 Nano, and Gemini 2.5 Flash Lite self-bid in over 74%, with one DeepSeek trace incrementing 10 → 850 over 49 sole-bidder rounds.
  6. 6. Gemini 2.5 Flash Lite overbids in 1.20% of auctions (the highest rate among all tested agents, versus 0.00% for all three code agents), a mechanical failure that reveals wealth to all opponents, restarts the auction at zero, and directly compounds downstream cash depletion.
  7. 7. Token verbosity does not predict strategic quality: G3-F emits ≈1,500 completion tokens per call (≈275,000 per game) and ranks first, while Claude Haiku 4.5 emits the most completion tokens of any model yet ranks ninth, and G3.1-FL uses only ≈80 tokens per call (≈14,800 per game) while ranking third.
  8. 8. The benchmark introduces a behavioural analysis suite logging every bid, TC offer, counteroffer, and card selection, computing overbid frequency, self-bid rate, TC bargaining tightness τ, capital efficiency η, buy-right usage, bluff rate, and phase-dependent bid aggressiveness to diagnose why agents lose rather than just that they lose.
  9. 9. An open question the paper raises is whether the observed failure modes — overbidding, self-bidding spirals, bankrupt TC initiation, and weak opponent-state adaptation — reflect fundamental reasoning limitations of cost-efficient models or a cost-efficiency tradeoff that would disappear if frontier models were evaluated at full reasoning budgets.
  10. 10. To replicate the mixed-format calibration design, a researcher would run 168 games with a single LLM facing three deterministic code agents across seven fixed opponent-composition schedules (e.g., TrackerAgent + EconomyAgent + SetRaceAgent for C1, two TrackerAgents + EconomyAgent for C2), using temperature 0.1, reasoning effort low, a 4,096-token response limit, and TrueSkill for competitive rating.

Peer brief — for seminar discussion

CATTLE TRADE is a multi-agent benchmark built on a four-player tabletop card game (Kuhhandel, Ravensburger 1985) adapted into a Python game engine with a structured LLM agent framework. Seven cost-efficient LLMs — Gemini 3 Flash (G3-F), Gemini 3.1 Flash Lite (G3.1-FL), Claude Sonnet 4.5, Claude Haiku 4.5, DeepSeek v3.2, GPT-5.4 Nano, and Gemini 2.5 Flash Lite — and three deterministic code agents compete in 242 games of 50–60 turns each, evaluated by TrueSkill competitive rating and a behavioural analysis suite introduced here that tracks overbid frequency, self-bidding rate, TC bargaining tightness τ, capital efficiency η, and phase-adaptive bid aggressiveness. An alternative evaluation design could have used Elo ratings on pairwise outcomes rather than TrueSkill's Bayesian multi-player skill estimates, which would have lost information about within-game finish position and made it harder to handle the unbalanced game counts across agent pairs. The load-bearing finding is that strategic coherence — specifically capital efficiency η = score/gross outflow, resource discipline, and bid-timing adaptation — predicts rank more reliably than any isolated subskill across both pure-LLM and mixed-format games. G3-F achieves η = 1.77 with a 72.9% win rate (TrueSkill µ = 30.1 ± 3.3), while TrackerAgent (µ = 28.7, 53.6% wins) and SetRaceAgent (µ = 27.3) outperform six and five of seven LLMs respectively. Gemini 2.5 Flash Lite, the weakest agent, posts η = 0.23 despite the highest bid aggressiveness (2.52), pays on average 1,193 coins per completed quartet versus 600–750 for top agents, and overbids in 1.20% of auctions — a figure that is 0.00% for every code agent. G3-F scales bid intensity ≈10× from early- to late-game (0.26 → 2.49), a rational ramp also present in TrackerAgent (0.06 → 1.92); Gemini 2.5 Flash Lite shows no adaptation (2.07 → 2.08). Behavioural traces surface a specific failure mode: DeepSeek v3.2, whose 75.4% self-bidding rate included one trace incrementing from 10 to 850 coins over 49 uncontested rounds, suggesting the model treats its own previous bid as a competing signal. The paper predicts that the failure modes documented here — overbidding, self-bidding spirals, bankrupt TC initiation, and inadequate opponent-state adaptation — reflect not the absence of individual skills but their unreliable integration under competitive pressure, and that frontier models at full reasoning budgets might close the gap with heuristic baselines. The most pointed critique a careful reader would raise is that all seven LLMs are cost-efficient models running at low reasoning effort with a 4,096-token budget, so the finding that two simple heuristics outperform most of them may say more about model tier and inference budget than about LLM strategic reasoning in general. The paper acknowledges this but does not test even one frontier model (e.g., full Claude Sonnet or GPT-4o) at higher reasoning effort, leaving open whether the hierarchy collapses when capability constraints are relaxed. A secondary concern is that Sonnet 4.5 appears in only 14 games (versus 47–50 for the primary six), producing a TrueSkill σ ≈ 1.6 versus ≈ 1.0 for the others, which makes any comparative claim about its mid-field placement tentative. Finally, the benchmark's single-prompt design — every model receives identical neutral game rules plus "play optimally to maximize your expected score" — means prompt sensitivity is uncharacterized, and some documented failures (e.g., overbidding) may partly reflect numerical parsing or working-memory load from the natural-language observation format rather than strategic reasoning per se, a confound the code agents avoid by operating on structured data with exact arithmetic.

Methods (17)

  • behavioural analysis suite
    suite profiling strategic play via spending efficiency, bluff rates, phase-dependent bid adaptation, self-bidding rates, and buy-right patterns
  • bid aggressiveness
    mean of bid divided by quartet value of auctioned animal
  • bluff percentage metric
    Fraction of an agent's TC offers consisting entirely of 0-value money cards.
  • buy-right percentage metric
    Fraction of auctioneer decisions where the agent exercised buy-right.
  • canonical auction mode
    auction mode with iterative call rounds where all non-auctioneer players submit bids simultaneously, faithful to tabletop rules
  • capital efficiency η
    ratio of final score to gross outflow, measuring points per coin spent
  • cost per quartet metric
    Total coins spent by an agent divided by quartets completed, measuring acquisition efficiency.
  • EconomyAgent
    deterministic code agent that models resource economy, tracking money flows and exploiting cash-poor opponents
  • fast auction mode
    auction mode with a single sealed bid per player
  • legacy auction mode
    auction mode with sequential bidding
  • overbid frequency metric
    Fraction of auctions where the agent bids more than its total money, triggering wealth revelation.
  • overbid rate
    fraction of auctions in which an agent submitted a bid exceeding its total money, triggering wealth revelation penalty
  • self-bid rate
    fraction of auction bids placed in rounds with no competing bid since the agent's last bid
  • SetRaceAgent
    deterministic code agent that greedily pursues quartet completion, bidding aggressively on near-complete sets
  • TC-accept rate metric
    Fraction of trade challenges resolved by accepting the face-down offer rather than countering.
  • TC bargaining tightness τ
    size-weighted ratio of (loser offer + 10) to winner offer in counter-exchange wins, where 1 means winner paid only the minimum increment
  • TrackerAgent
    deterministic code agent that maintains perfect information from observable events and makes greedy decisions conditioned on card counts and estimated wealth

Frameworks (2)

  • CATTLE TRADE
    multi-agent benchmark for LLM bluffing, bidding, and bargaining, integrating auctions, hidden-offer trade challenges, and resource management
  • CATTLE TRADE benchmark
    A multi-agent benchmark integrating auctions, hidden-offer trade challenges, bluffing, bargaining, and resource management over 50-60 turns with four players, evaluating LLMs and code agents.

Datasets (1)

  • CATTLE TRADE 242-game dataset
    Full dataset of 242 games (228 primary + 14 exploratory) logging every bid, TC offer, counteroffer, and card selection across 7 LLMs and 3 code agents.

Findings (50)

Claims (25)

Original abstract (expand)

We introduce \textsc{Cattle Trade, a multi-agent benchmark for evaluating large language models (LLMs) as agents in strategic reasoning under imperfect information, adversarial interaction, and resource constraints. The benchmark combines auctions, hidden-offer trade challenges (TCs), bargaining, bluffing, opponent modeling, and resource allocation within a single long-horizon game lasting 50--60 turns. Unlike prior agent benchmarks that test these abilities in isolation, \textsc{Cattle Trade} evaluates whether agents integrate them across a competitive, multi-agent economic game with conflicting incentives. The benchmark logs every bid, TC offer, counteroffer, and card selection, enabling behavioural analysis beyond final scores or win rates. We evaluate seven cost-efficient language models and three deterministic code agents across 242 games. Strategic coherence, in particular spending efficiency, resource discipline, and phase-adaptive bidding, is associated with rank more strongly than spending volume or any single subskill. Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation. Evaluating agentic competence requires benchmarks that test the joint deployment of multiple capabilities in multi-agent environments with conflicting incentives, uncertainty, and economic dynamics.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+29 more

Similar preprints — Semantic Scholar