G3-F within-agent score std 4,026 on median 5,250

High score variance driven by deck order.

Source paper

extracted_from

(2026) · Robert Müller · Clemens Müller

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

G3-F score std=4,026 on median 5,250finding0.882
high variance in scores due to multiplicative scoring sensitivity
G3.1-FL median score=3,930finding0.798
median final score, slightly higher than TrackerAgent despite lower win rate
Within-agent score std exceeds cross-seat win-rate differentials by 1–2 orders of magnitudefinding0.757
deck-order variance dominates seat-position variance
G3-F wins 67.9% of 28 mixed games (vs three code agents)finding0.755
Robust performance against algorithmic baselines.
G3-F overbid rate=0.00%finding0.743
Gemini 3 Flash never overbid
Within-agent score standard deviation suggests deck order matters more than seat position.claim0.735
Observation from variance analysis, though not tested as hypothesis.
G3-F TrueSkill μ=30.1 ± 3.3, 72.9% wins, median score 5,250 on combined-comp1 slice (n=98 canonical games)finding0.734
Top LLM performance with high win rate and large score.
Within-agent spread across seven evolvers is at most 5.1 pp (Qwen3-235B on MCP), small against the 36.0 pp gap between Opus and Qwen3-235B base capabilitiesfinding0.729
Demonstrates that post-evolution score is dominated by agent base capability, not evolver identity