NTUTaizhou Hospital of Zhejiang ProvinceThe Central Hospital of WuhanWHUApr 30, 2026arXiv:2604.28011

Echo-α: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation

Jing Zhang, Wentao Jiang, Tao Huang, Zhiwei Wang, Jianxin Liu, Jian Chen, Ping Ye, Gang Wang, Zengmao Wang, Bo Du, Dacheng Tao

AI Summary

Echo-α, an agentic multimodal reasoning model, was developed to improve ultrasound interpretation by combining precise lesion localization with holistic clinical reasoning. The model uses an invoke-and-reason framework to coordinate organ-specific detector outputs, integrate them with global visual context, and convert the evidence into diagnostic decisions. Through a nine-task supervised curriculum and sequential reinforcement learning, Echo-α achieves state-of-the-art performance on multi-center renal and breast ultrasound benchmarks for both lesion grounding and final diagnosis, demonstrating improved accuracy, interpretability, and transferability.

Key Contribution

By unifying specialized detectors with flexible reasoning, Echo-α achieves state-of-the-art ultrasound interpretation, suggesting a path toward more accurate and interpretable medical AI.

Abstract

Ultrasound interpretation requires both precise lesion localization and holistic clinical reasoning, yet existing methods typically excel at only one of these capabilities: specialized detectors offer strong localization but limited reasoning, whereas multimodal large language models (MLLMs) provide flexible reasoning but weak grounding in specialized medical domains. We present Echo-α, an agentic multimodal reasoning model for ultrasound interpretation that unifies these strengths within an invoke-and-reason framework. Echo-α is trained to coordinate organ-specific detector outputs, integrate them with global visual context, and convert the resulting evidence into grounded diagnostic decisions beyond detector-only inference. This behavior is established through a nine-task supervised curriculum and then refined by sequential reinforcement learning under different reward trade-offs, yielding Echo-α-Grounding for lesion anchoring and Echo-α-Diagnosis for final diagnosis. On multi-center renal and breast ultrasound benchmarks, Echo-α outperforms competitive baselines on both grounding and diagnosis. In particular, on cross-center test sets, Echo-α-Grounding attains 56.73%/43.78% F1@0.5 and Echo- α-Diagnosis reaches 74.90%/49.20% overall accuracy on renal/breast ultrasound. These results suggest that agentic multimodal reasoning can turn specialized detectors into verifiable clinical evidence, offering a practical route toward ultrasound AI systems that are more accurate, interpretable, and transferable. The repository is at https://github.com/MiliLab/Echo-Alpha.

Computer Vision Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Echo-α: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation

Related Papers