quote

active

quote:two-heuristic-code-agents-outperform-most-tested-llms-and-behavioural-traces-surface-recurring-llm-failure-modes-including-overbidding-self-bidding-bankrupt-tc-initiation-and-weak-opponent-state-adaptation

Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation.

Abstract sentence summarising performance and failures.

Source paper

extracted_from

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

(2026) · Robert Müller · Clemens Müller

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation that never appear in code agents.claim0.917
LLMs exhibit systematic errors that deterministic logic avoids.
Two heuristic code agents outperform most tested LLMsclaim0.863
author assertion that deterministic heuristics surpass many LLMs
Two heuristic code agents (TrackerAgent and SetRaceAgent) outperform most tested LLMs.claim0.851
Calibration that conditional logic can beat cost-efficient LLMs in this setting.
Code agents operate on structured data with exact arithmetic, while LLMs must parse natural-language observations and track state across turns; some failures may partly reflect numerical parsing or working-memory limitationsclaim0.834
discussion of potential confounds
Conditional logic already suffices where LLMs still fail, as code agents avoid systematic failuresclaim0.821
contrast between rule-based and LLM reasoning
Overbid frequency, self-bidding rate, bankrupt-initiation patterns, and context-dependent offer calibration are failure modes invisible to both static evaluations and aggregate rankings like Eloclaim0.814
key claim about the benchmark's unique diagnostic value
Card-counting heuristics suffice to outperform most LLMs tested.claim0.809
TrackerAgent's second-place ranking calibrates the benchmark and highlights LLM shortcomings.
Do LLM failures in CATTLE TRADE reflect genuinely hard strategic problems or errors that novice humans also avoid?question0.787
Open question about benchmarking against human players to calibrate difficulty.