Amazon ScienceFeb 12, 2026arXiv:2602.11737

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

AI Summary

The paper addresses object hallucination in Multimodal Large Language Models (MLLMs) by improving visual contrastive decoding (VCD) through the creation of an object-aligned auxiliary view. This auxiliary view is constructed by masking the most salient visual evidence based on object-centric attention from self-supervised Vision Transformers, thereby disrupting unsupported tokens during decoding. The proposed method, "Mask What Matters," is prompt-agnostic, model-agnostic, and computationally efficient, leading to improved performance on object hallucination benchmarks.

Key Contribution

Object hallucination in MLLMs can be significantly reduced by simply masking salient visual features during contrastive decoding.

Abstract

We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References26

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

Related Papers