Search papers, labs, and topics across Lattice.
This paper introduces a modified BiomedCLIP architecture for multi-label video capsule endoscopy (VCE) classification, specifically addressing class imbalance in the Galar dataset. The core modification involves replacing standard attention with a differential attention mechanism to reduce noise. The method also employs a sqrt-frequency weighted sampler, asymmetric focal loss, mixup regularization, and per-class threshold optimization to handle the skewed label distribution, achieving a temporal mAP@0.5 of 0.2456 on the RARE-VISION test set.
Differential attention and asymmetric loss functions can significantly improve the performance of BiomedCLIP on highly imbalanced video classification tasks like identifying rare pathologies in video capsule endoscopy.
This work presents a multi-label classification framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset through a combination of architectural and optimization-level strategies. Our approach modifies BiomedCLIP, a biomedical vision-language foundation model, by replacing its standard multi-head self-attention with a differential attention mechanism that computes the difference between two softmax attention maps to suppress attention noise. To counteract the skewed label distribution, where pathological findings constitute less than 0.1% of all annotated frames, a sqrt-frequency weighted sampler, asymmetric focal loss, mixup regularization, and per-class threshold optimization are employed. Temporal coherence is enforced through median-filter smoothing and gap merging prior to event-level JSON generation. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the pipeline achieves an overall temporal mAP@0.5 of 0.2456 and mAP@0.95 of 0.2353, with total inference completed in approximately 8.6 minutes on a single GPU.