Search papers, labs, and topics across Lattice.
This paper reviews the evolution of Visual Question Answering (VQA) systems from CNN-LSTM architectures to modern Multimodal Large Language Model (MLLM)-based approaches like BLIP-2 and LLaVA. It analyzes how LLMs and pre-trained vision encoders are integrated for visual reasoning, highlighting representative architectures and training paradigms. The paper identifies key challenges for MLLMs in VQA, including modality imbalance, cross-modal alignment, and hallucinations, concluding that improvements in alignment and hallucination mitigation are crucial for robust VQA.
MLLMs have revolutionized VQA, but still struggle with visual grounding and balanced multimodal fusion, hindering their reliability.
Multimodal understanding, which requires models to jointly reason over visual and linguistic information, has become a core challenge in artificial intelligence (AI). Visual Question Answering (VQA) stands as a paradigmatic task for investigating these multimodal reasoning capabilities. While early VQA systems relied on task-specific architectures, recent breakthroughs in Multimodal Large Language Models (MLLMs) have significantly reshaped the field by proposing unified, instruction-driven multimodal reasoning frameworks. By conducting a systematic literature review, this paper scrutinizes the evolution of VQA from traditional CNNLSTM-based models to modern MLLM-based approaches. The review centers on representative architectures and training paradigms, including BLIP-2 and LLaVA, to analyze how large language models and pretrained vision encoders are integrated for flexible and open-ended visual reasoning. In addition, this paper identifies and deliberates on critical challenges confronting contemporary MLLMs, encompassing modality imbalance, insufficient cross-modal alignment, and hallucinations. This paper concludes that while MLLMs have substantially expanded the application scope and functional capabilities of VQA systems, they still grapple with reliable visual grounding and balanced multimodal fusion. Addressing these limitations is paramount for constructing trustworthy and robust VQA systems, and future research should prioritize improving alignment mechanisms and mitigating hallucinations in multimodal reasoning.