HKUSTNTUNYUApr 22, 2026arXiv:2604.20421

Unlocking the Forecasting Economy: A Suite of Datasets for the Full Lifecycle of Prediction Market: [Experiments \&Analysis]

Huaiyu Jia, Luofeng Zhou, Lin William Cong, Siguang Li, Shuo Sun

AI Summary

This paper introduces a comprehensive dataset suite for decentralized prediction markets, specifically focusing on Polymarket, which encompasses the entire lifecycle from market creation to settlement. By addressing challenges related to data fragmentation and synchronization, the authors developed a unified relational data system that integrates diverse data sources, resulting in a robust dataset containing over 770,000 market records and nearly 2 million oracle events. The utility of this dataset is showcased through analyses of market activity and case studies, highlighting its potential for enhancing predictive accuracy in various applications, such as sports outcomes and economic indicators.

Key Contribution

A groundbreaking dataset suite reveals the intricate dynamics of decentralized prediction markets, offering unparalleled insights into collective forecasting behavior.

Abstract

Prediction markets are markets for trading claims on future events, such as presidential elections, and their prices provide continuously updated signals of collective beliefs. In decentralized platforms such as Polymarket, the market lifecycle spans market creation, token registration, trading, oracle interaction, dispute, and final settlement, yet the corresponding data are fragmented across heterogeneous off-chain and on-chain sources. We present the first continuously maintained dataset suite for the full lifecycle of decentralized prediction markets, built on Polymarket. To address the challenges of large-scale cross-source integration, incomplete linkage, and continuous synchronization, we build a unified relational data system that integrates three canonical layers: market metadata, fill-level trading records, and oracle-resolution events, through identifier resolution, on-chain recovery, and incremental updates. The resulting dataset spans October 2020 to March 2026 and comprises more than 770 thousand market records, over 943 million fill records, and nearly 2 million oracle events. We describe the data model, collection pipeline, and consistency mechanisms that make the dataset reproducible and extensible, and we demonstrate its utility through descriptive analyses of market activity and two downstream case studies: NBA outcome calibration and CPI expectation reconstruction.

Data Curation & Synthetic Data

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Unlocking the Forecasting Economy: A Suite of Datasets for the Full Lifecycle of Prediction Market: [Experiments \&Analysis]

Related Papers