Mar 18, 2026arXiv:2603.17879

Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification

Podakanti Satyajith Chary, Nagarajan Ganapathy

AI Summary

This paper introduces a modified BiomedCLIP architecture for multi-label video capsule endoscopy (VCE) classification, specifically addressing class imbalance in the Galar dataset. The core modification involves replacing standard attention with a differential attention mechanism to reduce noise. The method also employs a sqrt-frequency weighted sampler, asymmetric focal loss, mixup regularization, and per-class threshold optimization to handle the skewed label distribution, achieving a temporal mAP@0.5 of 0.2456 on the RARE-VISION test set.

Key Contribution

Differential attention and asymmetric loss functions can significantly improve the performance of BiomedCLIP on highly imbalanced video classification tasks like identifying rare pathologies in video capsule endoscopy.

Abstract

This work presents a multi-label classification framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset through a combination of architectural and optimization-level strategies. Our approach modifies BiomedCLIP, a biomedical vision-language foundation model, by replacing its standard multi-head self-attention with a differential attention mechanism that computes the difference between two softmax attention maps to suppress attention noise. To counteract the skewed label distribution, where pathological findings constitute less than 0.1% of all annotated frames, a sqrt-frequency weighted sampler, asymmetric focal loss, mixup regularization, and per-class threshold optimization are employed. Temporal coherence is enforced through median-filter smoothing and gap merging prior to event-level JSON generation. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the pipeline achieves an overall temporal mAP@0.5 of 0.2456 and mAP@0.95 of 0.2353, with total inference completed in approximately 8.6 minutes on a single GPU.

Computer Vision Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification

Related Papers