KCLApr 20, 2026arXiv:2604.18134

Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?

Chengan Che, Chao Wang, Jiayuan Huang, Xinyue Chen, Luis C. Garcia-Peraza-Herrera

AI Summary

This paper introduces LIME, a large-scale multi-modal dataset generated from open-access surgical videos using narratives produced by Large Language Models (LLMs), addressing the challenge of costly expert annotations in surgical vision-language tasks. To ensure the reliability of these noisy narratives, the authors propose SurgLIME, a parameter-efficient Vision-Language Pre-training framework that employs a dual-encoder architecture and an automated confidence estimation mechanism to down-weight uncertain text during contrastive alignment. Evaluations demonstrate that SurgLIME achieves competitive zero-shot cross-modal alignment while maintaining the robust performance of the underlying visual model on established benchmarks.

Key Contribution

LIME leverages LLM-generated narratives to create a scalable surgical dataset, but SurgLIME's innovative approach ensures that noisy text doesn't compromise model performance.

Abstract

Recent advancements in self-supervised learning have led to powerful surgical vision encoders capable of spatiotemporal understanding. However, extending these visual foundations to multi-modal reasoning tasks is severely bottlenecked by the prohibitive cost of expert textual annotations. To overcome this scalability limitation, we introduce \textbf{LIME}, a large-scale multi-modal dataset derived from open-access surgical videos using human-free, Large Language Model (LLM)-generated narratives. While LIME offers immense scalability, unverified generated texts may contain errors, including hallucinations, that could potentially lead to catastrophically degraded pre-trained medical priors in standard contrastive pipelines. To mitigate this, we propose \textbf{SurgLIME}, a parameter-efficient Vision-Language Pre-training (VLP) framework designed to learn reliable cross-modal alignments using noisy narratives. SurgLIME preserves foundational medical priors using a LoRA-adapted dual-encoder architecture and introduces an automated confidence estimation mechanism that dynamically down-weights uncertain text during contrastive alignment. Evaluations on the AutoLaparo and Cholec80 benchmarks show that SurgLIME achieves competitive zero-shot cross-modal alignment while preserving the robust linear probing performance of the visual foundation model. Dataset, code, and models are publicly available at \href{https://github.com/visurg-ai/SurgLIME}{https://github.com/visurg-ai/SurgLIME}.

Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?

Related Papers