Search papers, labs, and topics across Lattice.
The paper introduces Multimodal Variational Masked Autoencoder (MVMAE), a pre-training framework for Medical VQA designed to improve robustness against adversarial attacks. MVMAE employs masked modeling and variational inference with a multimodal bottleneck fusion module and reparameterization to extract robust latent representations. Experiments on medical VQA datasets show that MVMAE significantly improves resistance to adversarial attacks compared to other pre-training methods.
Medical VQA models can be made significantly more robust to adversarial attacks using a novel pre-training approach based on masked autoencoders and variational inference, without requiring additional data or complex procedures.
Medical Visual Question Answering (Medical VQA) plays an important role in medical informatics. However, the robustness of existing medical VQA models is severely challenged by adversarial attacks. Current methods (e.g. adversarial training and noise-based reasoning) heavily rely on additional data or complex procedures and often ignore model-level robustness. To address these issues, we propose Multimodal Variational Masked Autoencoder (MVMAE), a novel pre-training framework designed to enhance the robustness of the medical VQA task. MVMAE leverages masked modeling and variational inference to extract robust multimodal features. The framework introduces a low-cost multimodal bottleneck fusion module and employs reparameterization to sample robust latent representations, ensuring effective feature fusion and reconstruction. Extensive experiments on public medical VQA datasets demonstrate that MVMAE significantly improves resistance to various adversarial attacks and outperforms other medical multimodal pre-training methods.