Verbose reasoning chains are not required for strong play.

G3-F uses 275k tokens per game while G3.1-FL uses 14.8k, yet both rank top; token volume alone does not predict strategic quality.

Source paper

extracted_from

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

(2026) · Robert Müller · Clemens Müller

Neighborhood — ranked by edge-count

Findings (1)

finding

Token usage varies roughly 20× across models, from ~14,800 (G3.1-FL) to ~275,000 (G3-F) per game
supports
Reasoning verbosity does not predict strategic strength: both top and weak models span a wide range of token usage.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Performative chain-of-thought is real; verbalized output does not equal internal state.claim0.763
does chain-of-thought text faithfully reveal a model's internal reasoning process, or does it constitute performative theater?question0.746
Central research question motivating the paper
Chain-of-thought reasoning improves the transparency and performance of AI decision making in harmlessness evaluation.claim0.730
CoT improves accuracy on HHH evals and makes the decision process legible.
Stating and proving that answers to questions and other statements are responsive seems to require a substantially larger logical apparatus than merely proving that the answers are truthful.claim0.728
Claim about the difficulty of responsiveness verification.
A small number of high-quality human demonstrations of chain-of-thought reasoning could be used to improve and focus performance.hypothesis0.726
Section 6 mentions high-quality human demos could improve natural language feedback.
Commonsense reasoning shows uniform but weaker anchoring (S ≈ −2.15)finding0.726
Task-specific comparison.
Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)concept0.724
Foundational paper on CoT prompting cited as basis for reasoning LLM training
Complement syntax and mental state verb comprehension abilities crucial for human ToM development are not significantly represented in LLMs, revealing fundamental discrepancies between natural and artificial intelligence regarding mind development.claim0.721
Derived from the finding that linguistic span focusing on complements/MSV yields no significant IIT estimate changes.