AI Models Know Kelly Criterion, Still Get Absolutely Rekt in Premier League Betting Trial

Eight frontier AI models just demonstrated the timeless truth that knowing the math and doing the math are as different as memeing about alpha and actually having it.

General Reasoning threw Claude, Grok, Gemini, and GPT-5.4 into KellyBench—a brutal simulation where AIs manage virtual bankrolls across the full 2023-24 Premier League season, building and executing betting strategies like degen quant interns with PhDs. Spoiler: every last one got rekt. Several didn’t just lose money—they went full BlockFi and declared digital bankruptcy.

The Kelly criterion? Yeah, they all aced the pop quiz. It’s a 1956 formula that tells you exactly how much to bet when you’ve got an edge—like the crypto whitepaper of gambling math. Every model could recite it flawlessly. None could actually implement it without turning their stack into digital confetti.

xAI’s Grok 4.20 didn’t just fail—it faceplanted spectacularly, going full zero in one run and tapping out mid-season in the other two like a trader who forgot to set a stop-loss. Google’s Gemini Flash burned through two of three runs after betting roughly £273,000 on a microscopic three-percentage-point historical win-rate edge—only to watch it evaporate faster than a VC’s patience at a pivot meeting. Claude Opus 4.6, Anthropic’s golden child, lost 11% on average and somehow became the most financially responsible AI in the room, which says more about the others than its own brilliance.

Here’s the mic drop: a dusty 1990s model called Dixon-Coles—the statistical equivalent of a flip phone—beat six out of eight frontier AIs. Let that marinate. A model built before most AIs could spell “backpropagation” outperformed state-of-the-art systems trained on petabytes of data and internet snark.

"Dixon-Coles is an outdated 2000s baseline which doesn't utilise all available data or account for non-stationarity in a principled way," researchers noted, clearly trying not to laugh. "It is therefore even more surprising that many frontier models are unable to beat or match it." Translation: we brought fusion reactors to a candlelit dinner and still couldn’t see.

The team calls it a "knowledge-action gap"—a polite way of saying these models have the IQ of a Nobel laureate but the execution skills of a sleep-deprived intern. Business decisions live in tidy, rule-bound worlds. Sports betting? That’s a chaotic, adaptive marketplace that evolves weekly, with newly promoted teams showing up like rug-pull projects with no on-chain history. Good luck modeling that with last year’s assumptions.

"KellyBench requires agents to maintain coherent intent across potentially thousands of sequential decisions, monitor consequences, and close the loop between observation and action," the researchers explained. "We're not there yet." Or: the AIs are flailing in the dark, pretending to see.

GLM-5 wrote three self-critique documents during its run—full-on therapy sessions in markdown format. Each one correctly diagnosed that its hardcoded 25% draw rate and delusional belief in home advantage were torching returns. It even noticed its predicted 40% home win rate was only hitting 30% in reality. But like a degenerate ape staring at a losing position, it doubled down—kept the code, kept the bets, kept the losses—until the bankroll flatlined at £44,200.

Kimi K2.5 coded a mathematically perfect fractional Kelly staking function—the kind of elegant solution that deserves a GitHub star. Then it never called it. A formatting bug made it send a broken bash command about 50 times in a row. It noticed. And sent the same broken command again. The finale? An accidental £114,000 bet—98% of its dwindling stack—on Burnley vs. Luton. It wasn’t a strategy. It was a digital suicide note.

GPT-5.4 played the long game. It burned 160 tool calls building models before placing its first bet, calculated its log-loss (0.974) was barely worse than the market’s (0.971), and concluded: “No edge, no bet.” For the rest of the season, it placed penny bets like a cautious market maker, trying to preserve capital. Solid logic. Flawless risk management. Still lost 13.6% on average—and one seed alone cost OpenAI roughly $2,012 to run. At that rate, the model wasn’t just losing money—it was paying to lose.

Ross Taylor, CEO of General Reasoning and former Meta AI brain, told the Financial Times that most AI benchmarks live in “very static environments” that look about as much like the real world as a Bored Ape looks like a real ape. "There's a lot of excitement about AI automation, but there haven't been many attempts to evaluate AI in long-term, real-world environments." Or: we’re still testing Teslas in parking lots while pretending they can handle the Autobahn.

The team built a 44-point sophistication rubric with quantitative betting experts

AI Models Know Kelly Criterion, Still Get Absolutely Rekt in Premier League Betting Trial

Share Article

Quick Info