Search papers, labs, and topics across Lattice.
This paper introduces an Explicit Logic Channel (ELC) to validate, select, and enhance Multimodal Large Language Models (MLLMs) in zero-shot Visual-Language Comprehension (VLC) tasks. The ELC mimics human logical reasoning by integrating a LLM, a Visual Factual Module (VFM), and probabilistic inference for factual, counterfactual, and relational reasoning over visual evidence. Experiments on MC-VQA and HC-REC tasks demonstrate that the ELC and a proposed Consistency Rate (CR) effectively validate, select, and improve MLLMs, enhancing explainability and trustworthiness.
Mimicking human reasoning with an "Explicit Logic Channel" can validate, select, and even improve black-box MLLMs on visual-language tasks, even without ground truth.
Frontier Multimodal Large Language Models (MLLMs) exhibit remarkable capabilities in Visual-Language Comprehension (VLC) tasks. However, they are often deployed as zero-shot solution to new tasks in a black-box manner. Validating and understanding the behavior of these models become important for application to new task. We propose an Explicit Logic Channel, in parallel with the black-box model channel, to perform explicit logical reasoning for model validation, selection and enhancement. The frontier MLLM, encapsulating latent vision-language knowledge, can be considered as an Implicit Logic Channel. The proposed Explicit Logic Channel, mimicking human logical reasoning, incorporates a LLM, a VFM, and logical reasoning with probabilistic inference for factual, counterfactual, and relational reasoning over the explicit visual evidence. A Consistency Rate (CR) is proposed for cross-channel validation and model selection, even without ground-truth annotations. Additionally, cross-channel integration further improves performance in zero-shot tasks over MLLMs, grounded with explicit visual evidence to enhance trustworthiness. Comprehensive experiments conducted for two representative VLC tasks, i.e., MC-VQA and HC-REC, on three challenging benchmarks, with 11 recent open-source MLLMs from 4 frontier families. Our systematic evaluations demonstrate the effectiveness of proposed ELC and CR for model validation, selection and improvement on MLLMs with enhanced explainability and trustworthiness.