General ReasoningApr 30, 2026arXiv:2604.27865

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

Thomas J. Grady, Thomas Grady, Kip Parker, Iliyan Zarov, Henry Course, Chengxi Taylor, Ross Taylor

AI Summary

KellyBench, a new environment, is introduced to evaluate sequential decision-making in long-horizon, non-stationary environments, specifically sports betting markets. Agents must maximize bankroll growth in a simulated English Premier League season using historical data and adapting to changing market conditions. Results show that state-of-the-art language models consistently lose money, with the best model achieving only -8% return, and exhibit unsophisticated strategies compared to human experts, as measured by a novel rubric.

Key Contribution

Even the most advanced language models still lose money and demonstrate unsophisticated strategies when tasked with maximizing long-term bankroll growth in a realistic sports betting simulation, highlighting a significant gap in their sequential decision-making capabilities.

Abstract

Language models are saturating benchmarks for procedural tasks with narrow objectives. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals. In this paper we introduce KellyBench, an environment for evaluating sequential decision-making in sports betting markets. Agents are placed in a sequential simulation of the 2023-24 English Premier League season and tasked with maximising their long-term bankroll growth. They are given detailed historical data, including advanced statistics, lineups, and public odds. To succeed they must build machine learning models, identify edge in public markets, and adapt as the environment changes over time. We find that all frontier models evaluated lose money on average over the course of the season for five seeds. The best performing model achieves an average return of -8%, and many models experiencing ruin across seeds. To judge strategy sophistication, we use a human expert rubric to grade each model and find their approaches to be unsophisticated compared to human baselines; Claude Opus 4.6 achieves a rubric score of 26.5%, which means there is significant room for improvement. KellyBench is available as an open-access API endpoint at https://openreward.ai/GeneralReasoning/KellyBench.

Eval Frameworks & Benchmarks Tool Use & Agents World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

Related Papers