Search papers, labs, and topics across Lattice.
The paper critiques the current evaluation of time-series forecasting models, arguing that benchmarks dominated by strong periodicities and seasonalities lead to illusory gains for complex deep learning models. It demonstrates that simpler classical methods often perform comparably well on these datasets, questioning the justification for the increased complexity and computational cost of deep learning. The authors advocate for the adoption of more diverse benchmarks with a wider spectrum of non-stationarities and the inclusion of robust classical baselines in evaluations to ensure reported gains reflect genuine advances.
Time-series forecasting benchmarks are giving deep learning models undeserved credit, as simpler classical methods often perform just as well on datasets with strong periodicities.
We argue that the current practice of evaluating AI/ML time-series forecasting models, predominantly on benchmarks characterized by strong, persistent periodicities and seasonalities, obscures real progress by overlooking the performance of efficient classical methods. We demonstrate that these "standard" datasets often exhibit dominant autocorrelation patterns and seasonal cycles that can be effectively captured by simpler linear or statistical models, rendering complex deep learning architectures frequently no more performant than their classical counterparts for these specific data characteristics, and raising questions as to whether any marginal improvements justify the significant increase in computational overhead and model complexity. We call on the community to (I) retire or substantially augment current benchmarks with datasets exhibiting a wider spectrum of non-stationarities, such as structural breaks, time-varying volatility, and concept drift, and less predictable dynamics drawn from diverse real-world domains, and (II) require every deep learning submission to include robust classical and simple baselines, appropriately chosen for the specific characteristics of the downstream tasks' time series. By doing so, we will help ensure that reported gains reflect genuine scientific methodological advances rather than artifacts of benchmark selection favoring models adept at learning repetitive patterns.