Sep 26, 2025arXiv:2509.22014

Lightweight Structured Multimodal Reasoning for Clinical Scene Understanding in Robotics

AI Summary

This paper introduces a lightweight multimodal reasoning framework for clinical scene understanding, addressing limitations of current VLMs in temporal reasoning and structured output generation for robotics. The framework combines the Qwen2.5-VL-3B-Instruct model with a SmolAgent-based orchestration layer to enable chain-of-thought reasoning, speech-vision fusion, and dynamic tool invocation for generating structured scene graphs. Evaluations on Video-MME and a custom clinical dataset demonstrate competitive accuracy and improved robustness compared to existing VLMs, highlighting potential for applications like robot-assisted surgery.

Key Contribution

Clinical robots get a brain boost: a lightweight multimodal agent uses structured scene graphs and interpretable retrieval to outmaneuver larger VLMs in complex healthcare scenarios.

Abstract

Healthcare robotics requires robust multimodal perception and reasoning to ensure safety in dynamic clinical environments. Current Vision-Language Models (VLMs) demonstrate strong general-purpose capabilities but remain limited in temporal reasoning, uncertainty estimation, and structured outputs needed for robotic planning. We present a lightweight agentic multimodal framework for video-based scene understanding. Combining the Qwen2.5-VL-3B-Instruct model with a SmolAgent-based orchestration layer, it supports chain-of-thought reasoning, speech-vision fusion, and dynamic tool invocation. The framework generates structured scene graphs and leverages a hybrid retrieval module for interpretable and adaptive reasoning. Evaluations on the Video-MME benchmark and a custom clinical dataset show competitive accuracy and improved robustness compared to state-of-the-art VLMs, demonstrating its potential for applications in robot-assisted surgery, patient monitoring, and decision support.

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References34

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Lightweight Structured Multimodal Reasoning for Clinical Scene Understanding in Robotics

Related Papers