finding

active

finding:token-usage-varies-roughly-20-across-models-from-14-800-g3-1-fl-to-275-000-g3-f-per-game

Token usage varies roughly 20× across models, from ~14,800 (G3.1-FL) to ~275,000 (G3-F) per game

Reasoning verbosity does not predict strategic strength: both top and weak models span a wide range of token usage.

Source paper

extracted_from

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

(2026) · Robert Müller · Clemens Müller

Neighborhood — ranked by edge-count

Claims (1)

claim

Verbose reasoning chains are not required for strong play.
supports
G3-F uses 275k tokens per game while G3.1-FL uses 14.8k, yet both rank top; token volume alone does not predict strategic quality.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

G3.1-FL generates ~14,800 completion tokens per gamefinding0.845
Very efficient token usage with strong play.
G3-F completion tokens ~1,500 per call, G3.1-FL ~80 per callfinding0.823
verbose reasoning not required for strong play
G3-F win rate=72.9% in 98 canonical gamesfinding0.747
Gemini 3 Flash won nearly 3/4 of its games
For all three SAEs (1M, 4M, 34M), average active features per token <300, and reconstruction variance explained ≥65%.finding0.747
Basic SAE performance metrics.
G3-F mixed-format win rate=67.9% over 28 gamesfinding0.738
performance in mixed games against three code agents
G3.1-FL TrueSkill μ=28.0 ± 2.9, 44.9% win ratefinding0.724
Second-best LLM, competitive with TrackerAgent.
Expert iteration trained on 41,290 examples (44.7 million tokens) across 4 roundsfinding0.724
Training scale for second stage.
SDF training used 115.6 million tokens (rank-64 LoRA, learning rate 1e-4)finding0.723
Training details for first stage.