Search papers, labs, and topics across Lattice.
The paper identifies and formalizes a "Stationarity Bias" in time-series imputation benchmarks, where uniform random masking overemphasizes stationary regimes, leading to misleadingly good performance for simple methods. To address this, they propose a Stratified Stress-Test that evaluates imputation performance separately for Stationary and Transient regimes. Using Continuous Glucose Monitoring (CGM) data, they demonstrate that while linear interpolation excels in stationary regimes, deep learning models are crucial for preserving morphological fidelity during critical transients, and they introduce a method to train models with realistic missingness patterns derived from clinical trials.
Linear interpolation is surprisingly good at imputing time-series data during stable periods, but falls apart during critical transient events, revealing a hidden "RMSE Mirage" that masks signal distortion.
Time-series imputation benchmarks employ uniform random masking and shape-agnostic metrics (MSE, RMSE), implicitly weighting evaluation by regime prevalence. In systems with a dominant attractor -- homeostatic physiology, nominal industrial operation, stable network traffic -- this creates a systematic \emph{Stationarity Bias}: simple methods appear superior because the benchmark predominantly samples the easy, low-entropy regime where they trivially succeed. We formalize this bias and propose a \emph{Stratified Stress-Test} that partitions evaluation into Stationary and Transient regimes. Using Continuous Glucose Monitoring (CGM) as a testbed -- chosen for its rigorous ground-truth forcing functions (meals, insulin) that enable precise regime identification -- we establish three findings with broad implications:(i)~Stationary Efficiency: Linear interpolation achieves state-of-the-art reconstruction during stable intervals, confirming that complex architectures are computationally wasteful in low-entropy regimes.(ii)~Transient Fidelity: During critical transients (post-prandial peaks, hypoglycemic events), linear methods exhibit drastically degraded morphological fidelity (DTW), disproportionate to their RMSE -- a phenomenon we term the \emph{RMSE Mirage}, where low pointwise error masks the destruction of signal shape.(iii)~Regime-Conditional Model Selection: Deep learning models preserve both pointwise accuracy and morphological integrity during transients, making them essential for safety-critical downstream tasks. We further derive empirical missingness distributions from clinical trials and impose them on complete training data, preventing models from exploiting unrealistically clean observations and encouraging robustness under real-world missingness. This framework generalizes to any regulated system where routine stationarity dominates critical transients.