claim

active

claim:two-heuristic-code-agents-trackeragent-and-setraceagent-outperform-most-tested-llms

Two heuristic code agents (TrackerAgent and SetRaceAgent) outperform most tested LLMs.

Calibration that conditional logic can beat cost-efficient LLMs in this setting.

Source paper

extracted_from

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

(2026) · Robert Müller · Clemens Müller

Neighborhood — ranked by edge-count

Findings (2)

finding

TrackerAgent TrueSkill μ=28.7 ± 3.6, 53.6% win rate
supports
Best code agent outperforming six of seven LLMs.
SetRaceAgent TrueSkill μ=27.3±3.3
supports
fourth-highest TrueSkill rating

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Two heuristic code agents outperform most tested LLMsclaim0.868
author assertion that deterministic heuristics surpass many LLMs
Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation.quote0.851
Abstract sentence summarising performance and failures.
SetRaceAgent outperforms five of seven tested LLMsfinding0.838
SetRaceAgent ranked above DS-v3.2, GPT5.4-N, Haiku, G2.5-FL, and EconomyAgent.
TrackerAgent outperforms six of seven tested LLMsfinding0.830
In the 98-game slice, TrackerAgent had a higher win rate or TrueSkill than all LLMs except Gemini 3 Flash.
The code-agent ordering (TrackerAgent > SetRaceAgent > EconomyAgent) shows information exploitation matters more than greedy quartet-chasing, which in turn outperforms conservative budgetingclaim0.807
interpretation of what drives success among deterministic strategies
Code-agent ordering: TrackerAgent > SetRaceAgent > EconomyAgentfinding0.799
information exploitation outranks greedy quartet-chasing, which outranks conservative budgeting
Code agents operate on structured data with exact arithmetic, while LLMs must parse natural-language observations and track state across turns; some failures may partly reflect numerical parsing or working-memory limitationsclaim0.791
discussion of potential confounds
Card-counting heuristics suffice to outperform most LLMs tested.claim0.786
TrackerAgent's second-place ranking calibrates the benchmark and highlights LLM shortcomings.