Search papers, labs, and topics across Lattice.
This paper introduces a lightweight multimodal reasoning framework for clinical scene understanding, addressing limitations of current VLMs in temporal reasoning and structured output generation for robotics. The framework combines the Qwen2.5-VL-3B-Instruct model with a SmolAgent-based orchestration layer to enable chain-of-thought reasoning, speech-vision fusion, and dynamic tool invocation for generating structured scene graphs. Evaluations on Video-MME and a custom clinical dataset demonstrate competitive accuracy and improved robustness compared to existing VLMs, highlighting potential for applications like robot-assisted surgery.
Clinical robots get a brain boost: a lightweight multimodal agent uses structured scene graphs and interpretable retrieval to outmaneuver larger VLMs in complex healthcare scenarios.
Healthcare robotics requires robust multimodal perception and reasoning to ensure safety in dynamic clinical environments. Current Vision-Language Models (VLMs) demonstrate strong general-purpose capabilities but remain limited in temporal reasoning, uncertainty estimation, and structured outputs needed for robotic planning. We present a lightweight agentic multimodal framework for video-based scene understanding. Combining the Qwen2.5-VL-3B-Instruct model with a SmolAgent-based orchestration layer, it supports chain-of-thought reasoning, speech-vision fusion, and dynamic tool invocation. The framework generates structured scene graphs and leverages a hybrid retrieval module for interpretable and adaptive reasoning. Evaluations on the Video-MME benchmark and a custom clinical dataset show competitive accuracy and improved robustness compared to state-of-the-art VLMs, demonstrating its potential for applications in robot-assisted surgery, patient monitoring, and decision support.