claim
active
claim:behavioural-traces-surface-recurring-llm-failure-modes-including-overbidding-self-bidding-bankrupt-tc-initiation-and-weak-opponent-state-adaptation-that-never-appear-in-code-agentsBehavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation that never appear in code agents.
LLMs exhibit systematic errors that deterministic logic avoids.
Source paper
extracted_from(2026) · Robert Müller · Clemens Müller
Neighborhood — ranked by edge-count
Findings (6)
finding
- G2.5-FL raises its own bid in over three-quarters of auction rounds.
- Deterministic heuristics avoid the overbidding failure mode entirely.
- highest overbid frequency observed
- failure to condition action choice on resource state
- G2.5-FL self-bid rate=78.5%supportshighest self-bid rate among all agents
- G2.5-FL has the highest overbid frequency among all agents.
Questions (2)
question
- Ambiguity in interpreting the self-bidding metric: from a single trace, cannot distinguish error from aggressive strategy.
- Remains untested whether the specific LLM failures observed in CATTLE TRADE extend beyond this game.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Abstract sentence summarising performance and failures.
- key claim about the benchmark's unique diagnostic value
- Concrete failure signatures extracted from traces.
- Replication of Fontana et al. 2025 findings in the paper's own Experiment 2 baseline condition
- discussion of potential confounds
- Do LLM failures in CATTLE TRADE reflect genuinely hard strategic problems or errors that novice humans also avoid?question0.772Open question about benchmarking against human players to calibrate difficulty.
- Primary conclusion of the study based on temporal permutation analysis failing all three criteria.
- Conditional logic already suffices where LLMs still fail, as code agents avoid systematic failuresclaim0.766contrast between rule-based and LLM reasoning