Chengdu Medical CollegeDepartment of UltrasoundPeking Union Medical CollegeSchool of Clinical MedicineSCUThe First Affiliated Hospital of ChengduJun 16, 2026arXiv:2606.17437

Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos

Bo Gou, Jicheng Zhang, Jianlong Xiong, Tao He, Bentian Liu, Hai Wu, Yijiao Wang, Yu Zhang, Yujia Yang, Yun Dai, Jian Liu, Jie Wang

AI Summary

This paper addresses the challenges of automated classification of echocardiographic views by introducing the largest publicly available dataset, the Echocardiographic Videos of Nine Views (EV9V), which consists of over 5,000 videos and nearly a million frames. The authors benchmark various video classification architectures and propose a novel Spatio-Temporal Fusion Model (STFM) that integrates CNN and LSTM to effectively capture both spatial and temporal features while managing frame quality variations. Results indicate that STFM significantly enhances classification performance, showcasing the potential of uncertainty-aware learning in medical video analysis.

Key Contribution

The introduction of the EV9V dataset and STFM could revolutionize echocardiographic view classification, achieving superior performance through innovative spatio-temporal learning techniques.

Abstract

Automated classification of standard echocardiographic views is crucial for efficient clinical workflow but faces three main challenges. First, publicly available datasets are scarce and limited in scale and view coverage. Second, the performance of some modern video-level architectures for echocardiographic view classification remains underexplored. Third, some view categories exhibit highly similar spatial appearances, making single-frame features insufficient for discrimination, while heterogeneous frame quality complicates robust temporal information fusion. To address these challenges, we release the Echocardiographic Videos of Nine Views (EV9V) dataset, comprising 5,138 videos, 910,579 frames, and 9 standard views, which is, to the best of our knowledge, the largest publicly available echocardiography video dataset. Using EV9V, we systematically benchmark representative video classification architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. Furthermore, we propose a Spatio-Temporal Fusion Model (STFM), an efficient dual-stream CNN-LSTM (Long Short-Term Memory) framework that jointly captures spatial anatomical structures and temporal cardiac dynamics. The proposed framework leverages uncertainty-aware learning to preferentially sample representative video segments during training and evidence-based fusion during inference, improving robustness to variations in frame quality across echocardiographic videos. Extensive experiments demonstrate that our method achieves competitive performance across diverse video classification models, validating the effectiveness of uncertainty-aware spatio-temporal learning for echocardiographic view classification. The code is available at https://github.com/bgx666/stfm.

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos

Related Papers