CATTLE TRADE benchmark

A multi-agent benchmark integrating auctions, hidden-offer trade challenges, bluffing, bargaining, and resource management over 50-60 turns with four players, evaluating LLMs and code agents.

Neighborhood — ranked by edge-count

Methods (10)

method

canonical auction mode
implements
auction mode with iterative call rounds where all non-auctioneer players submit bids simultaneously, faithful to tabletop rules
scratchpad mechanism
implements
Free-text memory buffer updated each turn via an additional model call, included in subsequent observations under 'YOUR NOTES'.
Scratchpad memory mechanism
implements
Agent personal buffer updated after own turn via an extra model call, fed back into observations.
fast auction mode
implements
auction mode with a single sealed bid per player
legacy auction mode
implements
auction mode with sequential bidding
TrueSkill
implements
Bayesian skill rating system used for competitive ranking in CATTLE TRADE
TrueSkill rating system
implements
Bayesian skill rating system used to rank agents from game outcomes.
dynamic-programming subset-sum payment resolution
implements
Algorithm that finds the minimum-overpay combination of discrete money cards to meet a payment amount with no change given.
full memory mode
implements
Agent configuration where scratchpad is maintained and recent game events are provided in observations.
Structured JSON action interface
implements
Agents respond with JSON specifying exact card selections and amounts; includes multi-stage fallback for errors.

Concepts (7)

concept

auction
about
Competitive bidding mechanism in the game where players vie for animal cards.
discrete money with no change
about
Payments use fixed denominations; no change given, forcing overpayment and resource constraint management.
bluffing
about
Deceptive strategy using 0-value money cards in face-down offers to induce opponent acceptance without revealing true offer value.
Kuhhandel (You're Bluffing!) tabletop game
associated_with
Original 3–5 player card game by Rüdiger Koltze (Ravensburger, 1985) involving auctions, hidden offers, and bluffing, which CATTLE TRADE adapts.
hidden-offer trade challenge (TC)
about
Bilateral bargaining with face-down money offers, enabling bluffing via 0-value cards and information asymmetry.
imperfect information
about
Game condition where players do not know the exact money values held by opponents, only counts.
resource management
about
Allocating discrete money cards and animal holdings over many turns to maximize final score.

Artifacts (3)

artifact

EconomyAgent (resource-economy code agent)
about
Conservative agent capping spending, avoiding enriching leading opponents, bluffing only when safe.
SetRaceAgent (quartet-chasing code agent)
about
Greedy agent ignoring cost, bidding aggressively on near-complete sets.
TrackerAgent (perfect-information greedy code agent)
about
Deterministic agent that tracks revealed cards and opponent wealth, bids just above budgets, and uses buy-right for quartet-completing cards.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

CATTLE TRADE is a step toward evaluating agentic competence under more realistic conditions of strategic interactionclaim0.785
positioning of the benchmark
CATTLE TRADE concentrates auctions, hidden-offer TCs, bluffing, and resource management into a single environment.claim0.740
Distinctive integration of multiple pressures.
Safety benchmarksconcept0.739
Evaluation framework whose validity is questioned by presence of eval awareness.
HELM Benchmarkmethod0.737
Existing alignment benchmark mentioned as relevant but insufficient for measuring intrinsic contemplative alignment
Werewolf benchmarkframework0.731
LLM benchmark on the communication game Werewolf, cited.
consciousness benchmarksconcept0.679
Benchmarks designed to evaluate AI consciousness, which the paper argues are vulnerable to eval awareness inflation.
Do LLM failures in CATTLE TRADE reflect genuinely hard strategic problems or errors that novice humans also avoid?question0.674
Open question about benchmarking against human players to calibrate difficulty.
AILuminate Benchmarkmethod0.669
Comprehensive AI safety benchmark evaluating resistance to harmful prompts across hazard categories; used in Experiment 1