question

active

question:do-these-failure-modes-overbidding-self-bidding-bankrupt-initiation-generalise-to-other-economic-settings

Do these failure modes (overbidding, self-bidding, bankrupt initiation) generalise to other economic settings?

Remains untested whether the specific LLM failures observed in CATTLE TRADE extend beyond this game.

Source paper

extracted_from

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

(2026) · Robert Müller · Clemens Müller

Neighborhood — ranked by edge-count

Claims (1)

claim

Behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation that never appear in code agents.
gates
LLMs exhibit systematic errors that deterministic logic avoids.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Do these failure modes generalise to other economic settings?question0.874
open question from discussion
Overbid frequency, self-bidding rate, bankrupt-initiation patterns, and context-dependent offer calibration are failure modes invisible to both static evaluations and aggregate rankings like Eloclaim0.849
key claim about the benchmark's unique diagnostic value
Overbidding, self-bidding spirals, and undisciplined bluffing characterise failure.claim0.813
Concrete failure signatures extracted from traces.
Does a high self-bidding rate reflect a failure to detect non-competitive contexts or a deliberate escalation?question0.767
Ambiguity in interpreting the self-bidding metric: from a single trace, cannot distinguish error from aggressive strategy.
Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation.quote0.766
Abstract sentence summarising performance and failures.
Do LLM failures in CATTLE TRADE reflect genuinely hard strategic problems or errors that novice humans also avoid?question0.742
Open question about benchmarking against human players to calibrate difficulty.
Out-of-distribution collapse in foundation models is the same failure mode as Levin's regeneration failures and Alexander's lifeless buildings: loss of multi-scale coherence under stress.claim0.732
Do the documented failures reflect fundamental limitations or a cost-efficiency tradeoff of smaller models?question0.726
question for future work on frontier models