framework
archived
framework:cattle-trade-benchmarkCATTLE TRADE benchmark
A multi-agent benchmark integrating auctions, hidden-offer trade challenges, bluffing, bargaining, and resource management over 50-60 turns with four players, evaluating LLMs and code agents.
Neighborhood — ranked by edge-count
Methods (10)
method
- canonical auction modeimplementsauction mode with iterative call rounds where all non-auctioneer players submit bids simultaneously, faithful to tabletop rules
- scratchpad mechanismimplementsFree-text memory buffer updated each turn via an additional model call, included in subsequent observations under 'YOUR NOTES'.
- Scratchpad memory mechanismimplementsAgent personal buffer updated after own turn via an extra model call, fed back into observations.
- fast auction modeimplementsauction mode with a single sealed bid per player
- legacy auction modeimplementsauction mode with sequential bidding
- TrueSkillimplementsBayesian skill rating system used for competitive ranking in CATTLE TRADE
- TrueSkill rating systemimplementsBayesian skill rating system used to rank agents from game outcomes.
- Algorithm that finds the minimum-overpay combination of discrete money cards to meet a payment amount with no change given.
- full memory modeimplementsAgent configuration where scratchpad is maintained and recent game events are provided in observations.
- Structured JSON action interfaceimplementsAgents respond with JSON specifying exact card selections and amounts; includes multi-stage fallback for errors.
Concepts (7)
concept
- auctionaboutCompetitive bidding mechanism in the game where players vie for animal cards.
- Payments use fixed denominations; no change given, forcing overpayment and resource constraint management.
- bluffingaboutDeceptive strategy using 0-value money cards in face-down offers to induce opponent acceptance without revealing true offer value.
- Kuhhandel (You're Bluffing!) tabletop gameassociated_withOriginal 3–5 player card game by Rüdiger Koltze (Ravensburger, 1985) involving auctions, hidden offers, and bluffing, which CATTLE TRADE adapts.
- Bilateral bargaining with face-down money offers, enabling bluffing via 0-value cards and information asymmetry.
- Game condition where players do not know the exact money values held by opponents, only counts.
- resource managementaboutAllocating discrete money cards and animal holdings over many turns to maximize final score.
Artifacts (3)
artifact
- Conservative agent capping spending, avoiding enriching leading opponents, bluffing only when safe.
- Greedy agent ignoring cost, bidding aggressively on near-complete sets.
- Deterministic agent that tracks revealed cards and opponent wealth, bids just above budgets, and uses buy-right for quartet-completing cards.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- positioning of the benchmark
- Distinctive integration of multiple pressures.
- Evaluation framework whose validity is questioned by presence of eval awareness.
- Existing alignment benchmark mentioned as relevant but insufficient for measuring intrinsic contemplative alignment
- LLM benchmark on the communication game Werewolf, cited.
- Benchmarks designed to evaluate AI consciousness, which the paper argues are vulnerable to eval awareness inflation.
- Do LLM failures in CATTLE TRADE reflect genuinely hard strategic problems or errors that novice humans also avoid?question0.674Open question about benchmarking against human players to calibrate difficulty.
- Comprehensive AI safety benchmark evaluating resistance to harmful prompts across hazard categories; used in Experiment 1