finding
active
finding:g2-5-fl-initiates-a-trade-challenge-for-a-goose-with-zero-money-cards-offering-0-value-bluffG2.5-FL initiates a trade challenge for a goose with zero money cards, offering 0-value bluff
In one trace, G2.5-FL depleted money through overbidding and launched a TC with no resources, failing to condition action on resource state.
Source paper
extracted_from(2026) · Robert Müller · Clemens Müller
Neighborhood — ranked by edge-count
Claims (1)
claim
- The benchmark’s diagnostic value lies in identifying why a model loses, not just that it losessupportsargues for fine-grained behavioral analysis over aggregate rankings
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- sophisticated bluff calibration
- failure to condition action choice on resource state
- Failure to adapt bidding to game phase.
- Very efficient token usage with strong play.
- Half the games won against code agents.
- highest self-bid rate among all agents
- Robust performance against algorithmic baselines.
- Strong phase-adaptive bidding.