Search papers, labs, and topics across Lattice.
This paper introduces MM-Tramba, a novel medical visual question answering (Med-VQA) model that fuses Transformer and Mamba architectures to address limitations in long sequence context reasoning and multimodal interaction. The model employs multi-scale visual feature extraction, including a SS2D module for unstructured features, to enrich visual information. Experiments on VQA-RAD and SLAKE-EN datasets demonstrate MM-Tramba's effectiveness, particularly in handling long sequences and expanding the application of Transformer-Mamba fusion in multimodal domains.
By fusing Transformers and Mamba, MM-Tramba achieves state-of-the-art results in medical visual question answering, especially when reasoning over long sequences of questions.
Medical Visual Question Answering (Med-VQA) aims to analyze medical images (such as CT, MRI, X-ray) and answer relevant questions raised by users. The introduction of Transformer, deep learning has undergone tremendous changes, and it has achieved remarkable success in many fields such as natural language processing and computer vision. And it has been more commonly used in medical image analysis. Although Transformer and its variant models have been proven to be effective in many fields, the traditional Transformer architecture has a performance bottleneck in long sequence context reasoning due to the computational complexity of its attention mechanism, which is proportional to the square of the sequence length N2, limiting its scalability. On the other hand, when the same patient asks multiple questions, the questions can be viewed as a long sequence, and understanding multiple questions to give more accurate answers is necessary in Med-VQA. The Transformer-based method sometimes may not be able to give an accurate answer. Recently, Mamba, as the latest state-space model SSM, provides a better solution. However, the limitation of Mamba is that it focuses on temporal modeling, is not naturally adapted to complex multimodal interaction tasks, and is generally not as expressive as Transformer for high-dimensional data. To this end, this paper proposes a visual question answering model based on the multi-modal and multi-scale Transformer-Mamba architecture called MM-Tramba. To address the insufficient utilization of visual information in the existing Med-VQA framework, this model uses multi-scale visual feature extraction and extracts unstructured features through the SS2D module to obtain richer visual information. It then uses the Transformer and Mamba fusion architecture to fuse modal information to achieve better performance. This paper conducts multiple experiments on the benchmark datasets VQA-RAD [1] and SLAKE-EN [2]. Experimental results show that MM-Tramba is not only a powerful model with excellent performance on long sequence problems, but also helps to greatly expand the Transformer and Mamba fusion architecture in the multimodal field.