Mar 12, 2026arXiv:2603.11838

DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining

Yu Yan, Yutong Yan, Raphael Tang, Zhenyu Gao, Wenxin Jiang, Wenxi Jiang, Yao Lu

AI Summary

DatedGPT, a family of 1.3B parameter language models, were trained from scratch on temporally partitioned data (2013-2024) to prevent lookahead bias in financial forecasting. Each model was trained on ~100B tokens with annual cutoffs and instruction fine-tuned on time-aware general and financial datasets. Perplexity probing confirms effective knowledge boundaries, and benchmark results show competitive performance, demonstrating the feasibility of time-aware pretraining.

Key Contribution

Training LLMs on temporally partitioned data reveals a practical method for mitigating lookahead bias, enabling more reliable financial forecasting.

Abstract

In financial backtesting, large language models pretrained on internet-scale data risk introducing lookahead bias that undermines their forecasting validity, as they may have already seen the true outcome during training. To address this, we present DatedGPT, a family of twelve 1.3B-parameter language models, each trained from scratch on approximately 100 billion tokens of temporally partitioned data with strict annual cutoffs spanning 2013 to 2024. We further enhance each model with instruction fine-tuning on both general-domain and finance-specific datasets curated to respect the same temporal boundaries. Perplexity-based probing confirms that each model's knowledge is effectively bounded by its data cutoff year, while evaluation on standard benchmarks shows competitive performance with existing models of similar scale. We provide an interactive web demo that allows users to query and compare responses from models across different cutoff years.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Open-Source Models & Weights

Citation Metrics

Citations1

Influential citations0

References27

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining

Related Papers