Search papers, labs, and topics across Lattice.
EchoPilot, a training-free framework, addresses the challenges of ultrasound video segmentation by integrating a medical vision-language model (VLM), a vision foundation model (VFM), and a promptable video segmentor. It introduces Scale-Space Semantic Prompting to resolve initialization ambiguity by selecting an optimal contextual view and synthesizing auxiliary point prompts. A Reliability-Gated Memory update reduces propagation drift by selectively freezing the segmentor's memory bank under uncertain predictions.
Achieve state-of-the-art ultrasound video segmentation with only a single point click and anatomical category name, surpassing even finetuned specialists, thanks to a novel training-free framework.
Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation. Recent promptable foundation models enable point-guided segmentation, but their direct deployment in ultrasound remains unreliable: a single point provides insufficient spatial context to resolve scale ambiguity, and greedy memory updates amplify early errors into severe temporal drift. We present EchoPilot, a training-free framework for ultrasound video segmentation under sparse first-frame interaction, requiring only a single point click and an anatomical category name. EchoPilot orchestrates a frozen medical vision-language model (VLM) for semantic localization, a vision foundation model (VFM) for dense geometric feature extraction, and a promptable video segmentor for mask prediction and propagation. To resolve initialization ambiguity, we propose Scale-Space Semantic Prompting, which first selects an optimal contextual view via a parameter-free S.E.E.D. (Semantic Energy-Entropy Density) criterion, and then synthesizes geometrically precise auxiliary point prompts from dense foundation features without additional user interaction. To reduce propagation drift, a Reliability-Gated Memory update is further introduced to selectively freeze the segmentor's memory bank under uncertain predictions, preventing error accumulation. We also contribute the first dynamic fetal placenta ultrasound video segmentation dataset with 671 annotated frames. Across three ultrasound video datasets, EchoPilot achieves state-of-the-art performance under the sparse-interactive setting, consistently outperforming training-free baselines and finetuned specialists.