Search papers, labs, and topics across Lattice.
The paper introduces Modality-Decoupled Direct Preference Optimization (MoD-DPO) to address cross-modal hallucinations in omni-modal large language models. MoD-DPO uses modality-aware regularization to enforce invariance to irrelevant modality corruptions and sensitivity to relevant modality perturbations, alongside a language-prior debiasing penalty. Experiments on audiovisual hallucination benchmarks show MoD-DPO improves perception accuracy and hallucination resistance compared to existing preference optimization methods.
Omni LLMs can be made significantly more reliable by decoupling modalities during preference optimization, reducing cross-modal hallucinations without increasing the training budget.
Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.