finding

active

finding:trackeragent-outperforms-six-of-seven-tested-llms

TrackerAgent outperforms six of seven tested LLMs

In the 98-game slice, TrackerAgent had a higher win rate or TrueSkill than all LLMs except Gemini 3 Flash.

Source paper

extracted_from

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

(2026) · Robert Müller · Clemens Müller

Neighborhood — ranked by edge-count

Claims (2)

claim

The code-agent ordering (TrackerAgent > SetRaceAgent > EconomyAgent) shows information exploitation matters more than greedy quartet-chasing, which in turn outperforms conservative budgeting
supports
interpretation of what drives success among deterministic strategies
Two heuristic code agents outperform most tested LLMs
supports
author assertion that deterministic heuristics surpass many LLMs

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SetRaceAgent outperforms five of seven tested LLMsfinding0.873
SetRaceAgent ranked above DS-v3.2, GPT5.4-N, Haiku, G2.5-FL, and EconomyAgent.
Two heuristic code agents (TrackerAgent and SetRaceAgent) outperform most tested LLMs.claim0.830
Calibration that conditional logic can beat cost-efficient LLMs in this setting.
Hardest composition for LLMs: two TrackerAgents (C2, C7), only G3-F still wins majorityfinding0.791
card-counting pressure compounds with multiple TrackerAgents
TrackerAgent win rate=53.6%finding0.787
TrackerAgent won more than half its games
TrackerAgent win rate 53.6% in 98 canonical gamesfinding0.768
TrackerAgent won over half of the combined-comp1 games.
TrackerAgent and SetRaceAgent have TC tightness τ ≈ 0.2–0.25, looser countersfinding0.757
Code agents trade bargaining precision for acquisition pressure.
Card-counting heuristics suffice to outperform most LLMs tested.claim0.753
TrackerAgent's second-place ranking calibrates the benchmark and highlights LLM shortcomings.
Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation.quote0.750
Abstract sentence summarising performance and failures.