Search papers, labs, and topics across Lattice.
This paper introduces Saliency-R1, a framework that enhances the interpretability and faithfulness of vision-language models by aligning model-generated saliency maps with human-annotated bounding boxes during training. They use a novel saliency map technique to highlight critical image regions contributing to generated tokens without additional computational overhead. By using the overlap between saliency maps and human annotations as a reward function within a Group Relative Policy Optimization (GRPO) framework, Saliency-R1 encourages the model to focus on relevant visual areas during reasoning, improving faithfulness, interpretability, and overall task performance.
Force your VLMs to *show their work*: Saliency-R1 aligns model attention with human-annotated visual cues, boosting faithfulness and interpretability without extra compute.
Vision-language models (VLMs) have achieved remarkable success across diverse tasks. However, concerns about their trustworthiness persist, particularly regarding tendencies to lean more on textual cues than visual evidence and the risk of producing ungrounded or fabricated responses. To address these issues, we propose Saliency-R1, a framework for improving the interpretability and faithfulness of VLMs reasoning. Specifically, we introduce a novel saliency map technique that efficiently highlights critical image regions contributing to generated tokens without additional computational overhead. This can further be extended to trace how visual information flows through the reasoning process to the final answers, revealing the alignment between the thinking process and the visual context. We use the overlap between the saliency maps and human-annotated bounding boxes as the reward function, and apply Group Relative Policy Optimization (GRPO) to align the salient parts and critical regions, encouraging models to focus on relevant areas when conduct reasoning. Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance.