National Cheng Kung UniversityNYCUMay 21, 2026arXiv:2605.22096

VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results

Bo-Cheng Qiu, Fang-Ying Lin, Ming-Han Sun, Yu-Fan Lin, Chia-Ming Lee, Chih-Chung Hsu

AI Summary

The paper introduces VISTA, a multi-backbone framework for detecting rare pathologies in video capsule endoscopy (VCE) data, leveraging both temporal and spatial foundation models. VISTA uses EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, combined with a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). Post-competition refinements, particularly in threshold refinement, significantly boosted performance, achieving a temporal mAP@0.5 of 0.3726 and mAP@0.95 of 0.3431 on the hidden test set.

Key Contribution

Fusing spatial and temporal foundation models with anatomical priors dramatically improves rare pathology detection in capsule endoscopy, achieving state-of-the-art results.

Abstract

Capsule endoscopy event detection is challenging because clinically relevant findings are sparse, visually heterogeneous, and evaluated at the event level rather than by frame accuracy. We propose VISTA, a metric-aligned multi-backbone framework for the RAREVISION task. VISTA combines EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, followed by a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). The original official submission achieved hidden-test temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235. After the competition, extending local threshold refinement with a global coarse search improved performance to 0.3726 mAP@0.5 and 0.3431 mAP@0.95, ranking Team ACVLab second in the post-competition evaluation.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results

Related Papers