GPT5.4-N wins 14.3% of mixed games

Similarly poor against code agents.

Source paper

extracted_from

(2026) · Robert Müller · Clemens Müller

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

G3.1-FL wins 50.0% of 28 mixed gamesfinding0.845
Half the games won against code agents.
DS-v3.2 wins 10.7% of mixed gamesfinding0.841
Poor performance against code agents.
G3-F wins 67.9% of 28 mixed games (vs three code agents)finding0.833
Robust performance against algorithmic baselines.
G3-F mixed-format win rate=67.9% over 28 gamesfinding0.802
performance in mixed games against three code agents
Haiku wins 7.1% of mixed gamesfinding0.802
Very low win rate against code agents.
GPT5.4-N overbid rate=0.47%finding0.797
overbid rate for GPT-5.4 Nano
Gemini 3 Flash wins 67.9% of its 28 mixed-format games against code agentsfinding0.768
In the 172-game exp2 slice, G3-F has the highest LLM win rate against deterministic baselines.
G3-F win rate=72.9% in 98 canonical gamesfinding0.752
Gemini 3 Flash won nearly 3/4 of its games