G3-F completion tokens ~1,500 per call, G3.1-FL ~80 per call

verbose reasoning not required for strong play

Source paper

extracted_from

(2026) · Robert Müller · Clemens Müller

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

G3.1-FL generates ~14,800 completion tokens per gamefinding0.880
Very efficient token usage with strong play.
Token usage varies roughly 20× across models, from ~14,800 (G3.1-FL) to ~275,000 (G3-F) per gamefinding0.823
Reasoning verbosity does not predict strategic strength: both top and weak models span a wide range of token usage.
G3.1-FL TrueSkill μ=28.0 ± 2.9, 44.9% win ratefinding0.771
Second-best LLM, competitive with TrackerAgent.
G3-F overbid rate=0.00%finding0.747
Gemini 3 Flash never overbid
G3-F TrueSkill μ=30.1 ± 3.3, 72.9% wins, median score 5,250 on combined-comp1 slice (n=98 canonical games)finding0.739
Top LLM performance with high win rate and large score.
G3-F conditions TC offers on opponent wealth and game context, e.g., 0-value bluffs against bankrupt opponentsfinding0.730
sophisticated bluff calibration
G3-F score std=4,026 on median 5,250finding0.730
high variance in scores due to multiplicative scoring sensitivity
G3.1-FL buy-right rate 31.3%finding0.728
High buy-right usage among LLMs.