finding

active

finding:hardest-composition-for-llms-two-trackeragents-c2-c7-only-g3-f-still-wins-majority

Hardest composition for LLMs: two TrackerAgents (C2, C7), only G3-F still wins majority

card-counting pressure compounds with multiple TrackerAgents

Source paper

extracted_from

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

(2026) · Robert Müller · Clemens Müller

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

TrackerAgent outperforms six of seven tested LLMsfinding0.791
In the 98-game slice, TrackerAgent had a higher win rate or TrueSkill than all LLMs except Gemini 3 Flash.
Li et al. 2024: larger LLMs outperform smaller ones at distinguishing self-related from non-self-related properties on self-awareness benchmarksfinding0.762
Prior finding showing scale-dependent self-awareness, consistent with the scale effect observed in the paper's Experiment 1
All cases satisfying Criteria 1 and 2 (two out of three) originate from deeper transformer layers and/or the 2/3 layer of LLMs.finding0.745
Consistent with literature that deeper layers encode semantic information and align with human brain activity.
Two heuristic code agents (TrackerAgent and SetRaceAgent) outperform most tested LLMs.claim0.744
Calibration that conditional logic can beat cost-efficient LLMs in this setting.
Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation.quote0.737
Abstract sentence summarising performance and failures.
Card-counting heuristics suffice to outperform most LLMs tested.claim0.736
TrackerAgent's second-place ranking calibrates the benchmark and highlights LLM shortcomings.
SetRaceAgent outperforms five of seven tested LLMsfinding0.733
SetRaceAgent ranked above DS-v3.2, GPT5.4-N, Haiku, G2.5-FL, and EconomyAgent.
The difficulty boundary for truth directions replicates across all four tested models (Llama-3.2-3B, Llama-3.1-8B, Gemma-2-2b, Gemma-2-9b); generalization to F3-F5 remains consistently low regardless of model size or family.finding0.726
Establishes generalizability of the core difficulty-boundary finding across model families.