Two heuristic code agents outperform most tested LLMs

author assertion that deterministic heuristics surpass many LLMs

Source paper

extracted_from

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

(2026) · Robert Müller · Clemens Müller

Neighborhood — ranked by edge-count

Findings (3)

finding

TrackerAgent outperforms six of seven tested LLMs
supports
In the 98-game slice, TrackerAgent had a higher win rate or TrueSkill than all LLMs except Gemini 3 Flash.
Gemini 3 Flash wins 72.9% of 98 canonical games
supports
G3-F achieved a win rate of 72.9% in the combined-comp1 98-game slice.
SetRaceAgent outperforms five of seven tested LLMs
supports
SetRaceAgent ranked above DS-v3.2, GPT5.4-N, Haiku, G2.5-FL, and EconomyAgent.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Two heuristic code agents (TrackerAgent and SetRaceAgent) outperform most tested LLMs.claim0.868
Calibration that conditional logic can beat cost-efficient LLMs in this setting.
Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation.quote0.863
Abstract sentence summarising performance and failures.
Card-counting heuristics suffice to outperform most LLMs tested.claim0.817
TrackerAgent's second-place ranking calibrates the benchmark and highlights LLM shortcomings.
Code agents operate on structured data with exact arithmetic, while LLMs must parse natural-language observations and track state across turns; some failures may partly reflect numerical parsing or working-memory limitationsclaim0.789
discussion of potential confounds
Conditional logic already suffices where LLMs still fail, as code agents avoid systematic failuresclaim0.766
contrast between rule-based and LLM reasoning
The three code agents never overbidfinding0.765
Deterministic heuristics avoid the overbidding failure mode entirely.
LLMs can predict their own responses more accurately than external observers, implying privileged internal knowledgefinding0.737
Binder et al. finding cited as evidence that LLMs possess introspective capacity analogous to mindfulness
Li et al. 2024: larger LLMs outperform smaller ones at distinguishing self-related from non-self-related properties on self-awareness benchmarksfinding0.733
Prior finding showing scale-dependent self-awareness, consistent with the scale effect observed in the paper's Experiment 1