Search papers, labs, and topics across Lattice.
TRACE introduces a novel conditional estimation paradigm for multimodal time series foundation models (TS-FMs) that addresses the challenges of temporal misalignment and partial modality missingness. By systematically inferring incomplete target modalities from available auxiliary modalities, TRACE enhances the robustness and reliability of cross-modal representations in real-world applications. Evaluated on diverse benchmarks, including healthcare and sentiment analysis, TRACE outperforms existing fusion methods, particularly in scenarios with significant modality absence.
TRACE significantly improves multimodal time series representation by effectively inferring missing data, outperforming traditional methods under severe modality missingness.
Time series foundation models (TS-FMs) aim to learn generalizable temporal representations that can be adapted to a wide range of downstream tasks. In real-world multimodal settings, time series are frequently affected by temporal misalignment and partial modality missingness, where different modalities are observed at heterogeneous time scales or are partially absent. Existing approaches typically rely on naive imputation or masking strategies, which fail to account for cross-modal dependencies and often lead to misaligned or degraded representations. We propose TRACE, a conditional estimation paradigm for multimodal time series foundation model pipelines under missingness and irregular sampling, allowing incomplete target modalities to be systematically inferred from available auxiliary modalities. We evaluate TRACE on diverse multimodal benchmarks spanning healthcare and affective computing, including the MIMIC-IV clinical dataset and the CMU-MOSI and CMU-MOSEI benchmarks for multimodal sentiment analysis. Across a range of downstream prediction tasks and missing-modality settings, TRACE consistently outperforms prior multimodal fusion approaches, demonstrating improved robustness to severe modality missingness and more reliable cross-modal representations.