Search papers, labs, and topics across Lattice.
The paper introduces VISTA, a multi-backbone framework for detecting rare pathologies in video capsule endoscopy (VCE) data, leveraging both temporal and spatial foundation models. VISTA uses EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, combined with a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). Post-competition refinements, particularly in threshold refinement, significantly boosted performance, achieving a temporal mAP@0.5 of 0.3726 and mAP@0.95 of 0.3431 on the hidden test set.
Fusing spatial and temporal foundation models with anatomical priors dramatically improves rare pathology detection in capsule endoscopy, achieving state-of-the-art results.
Capsule endoscopy event detection is challenging because clinically relevant findings are sparse, visually heterogeneous, and evaluated at the event level rather than by frame accuracy. We propose VISTA, a metric-aligned multi-backbone framework for the RAREVISION task. VISTA combines EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, followed by a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). The original official submission achieved hidden-test temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235. After the competition, extending local threshold refinement with a global coarse search improved performance to 0.3726 mAP@0.5 and 0.3431 mAP@0.95, ranking Team ACVLab second in the post-competition evaluation.