G3.1-FL generates ~14,800 completion tokens per game

Very efficient token usage with strong play.

Source paper

extracted_from

(2026) · Robert Müller · Clemens Müller

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

G3-F completion tokens ~1,500 per call, G3.1-FL ~80 per callfinding0.880
verbose reasoning not required for strong play
Token usage varies roughly 20× across models, from ~14,800 (G3.1-FL) to ~275,000 (G3-F) per gamefinding0.845
Reasoning verbosity does not predict strategic strength: both top and weak models span a wide range of token usage.
G3.1-FL wins 50.0% of 28 mixed gamesfinding0.783
Half the games won against code agents.
G3-F win rate=72.9% in 98 canonical gamesfinding0.767
Gemini 3 Flash won nearly 3/4 of its games
G3.1-FL TrueSkill μ=28.0 ± 2.9, 44.9% win ratefinding0.762
Second-best LLM, competitive with TrackerAgent.
G3-F mixed-format win rate=67.9% over 28 gamesfinding0.762
performance in mixed games against three code agents
G3-F wins 67.9% of 28 mixed games (vs three code agents)finding0.760
Robust performance against algorithmic baselines.
G3-F TrueSkill μ=30.1 ± 3.3, 72.9% wins, median score 5,250 on combined-comp1 slice (n=98 canonical games)finding0.744
Top LLM performance with high win rate and large score.