Humanitas UniversityUCLMar 10, 2026arXiv:2603.09696

TemporalDoRA: Temporal PEFT for Robust Surgical Video Question Answering

Luca Carlini, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi, Sophia Bano, Mobarak I. Hoque

AI Summary

TemporalDoRA is introduced as a video-specific PEFT method for surgical VideoQA, extending Weight-Decomposed Low-Rank Adaptation by incorporating temporal Multi-Head Attention (MHA) within the low-rank bottleneck of the vision encoder and applying weight decomposition selectively to the trainable low-rank branch. This approach facilitates temporally-aware updates while maintaining a frozen backbone, leading to improved robustness against linguistic variations in questions. Evaluated on a new colonoscopy VideoQA dataset (REAL-Colon-VQA) and EndoVis18-VQA, TemporalDoRA demonstrates enhanced performance, particularly on Out-of-Template questions, highlighting the importance of temporal mixing within the low-rank branch.

Key Contribution

Achieve robust surgical video question answering by injecting temporal awareness into parameter-efficient fine-tuning, outperforming standard PEFT methods on out-of-template questions.

Abstract

Surgical Video Question Answering (VideoQA) requires accurate temporal grounding while remaining robust to natural variation in how clinicians phrase questions, where linguistic bias can arise. Standard Parameter Efficient Fine Tuning (PEFT) methods adapt pretrained projections without explicitly modeling frame-to-frame interactions within the adaptation pathway, limiting their ability to exploit sparse temporal evidence. We introduce TemporalDoRA, a video-specific PEFT formulation that extends Weight-Decomposed Low-Rank Adaptation by (i) inserting lightweight temporal Multi-Head Attention (MHA) inside the low-rank bottleneck of the vision encoder and (ii) selectively applying weight decomposition only to the trainable low-rank branch rather than the full adapted weight. This design enables temporally-aware updates while preserving a frozen backbone and stable scaling. By mixing information across frames within the adaptation subspace, TemporalDoRA steers updates toward temporally consistent visual cues and improves robustness with minimal parameter overhead. To benchmark this setting, we present REAL-Colon-VQA, a colonoscopy VideoQA dataset with 6,424 clip--question pairs, including paired rephrased Out-of-Template questions to evaluate sensitivity to linguistic variation. TemporalDoRA improves Out-of-Template performance, and ablation studies confirm that temporal mixing inside the low-rank branch is the primary driver of these gains. We also validate on EndoVis18-VQA adapted to short clips and observe consistent improvements on the Out-of-Template split. Code and dataset available at~\href{https://anonymous.4open.science/r/TemporalDoRA-BFC8/}{Anonymous GitHub}.

Computer Vision Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TemporalDoRA: Temporal PEFT for Robust Surgical Video Question Answering

Related Papers