Search papers, labs, and topics across Lattice.
The paper introduces VETime, a novel zero-shot time series anomaly detection (TSAD) framework that unifies temporal and visual modalities to address the limitations of existing 1D and 2D foundation models. VETime uses Reversible Image Conversion and Patch-Level Temporal Alignment to create a shared visual-temporal timeline, preserving fine-grained details and temporal sensitivity. By incorporating Anomaly Window Contrastive Learning and Task-Adaptive Multi-Modal Fusion, VETime achieves state-of-the-art zero-shot TSAD performance with improved localization precision and reduced computational cost.
By unifying temporal and visual modalities with fine-grained alignment, VETime leapfrogs existing approaches to achieve state-of-the-art zero-shot time series anomaly detection.
Time-series anomaly detection (TSAD) requires identifying both immediate Point Anomalies and long-range Context Anomalies. However, existing foundation models face a fundamental trade-off: 1D temporal models provide fine-grained pointwise localization but lack a global contextual perspective, while 2D vision-based models capture global patterns but suffer from information bottlenecks due to a lack of temporal alignment and coarse-grained pointwise detection. To resolve this dilemma, we propose VETime, the first TSAD framework that unifies temporal and visual modalities through fine-grained visual-temporal alignment and dynamic fusion. VETime introduces a Reversible Image Conversion and a Patch-Level Temporal Alignment module to establish a shared visual-temporal timeline, preserving discriminative details while maintaining temporal sensitivity. Furthermore, we design an Anomaly Window Contrastive Learning mechanism and a Task-Adaptive Multi-Modal Fusion to adaptively integrate the complementary perceptual strengths of both modalities. Extensive experiments demonstrate that VETime significantly outperforms state-of-the-art models in zero-shot scenarios, achieving superior localization precision with lower computational overhead than current vision-based approaches. Code available at: https://github.com/yyyangcoder/VETime.