CASFeb 25, 2026arXiv:2602.21655

CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou, Anxiang Zeng, Jianqiang Huang

AI Summary

The paper introduces CCCaption, a dual-reward reinforcement learning framework designed to optimize image captioning models for completeness and correctness, addressing the limitations of human-annotated ground truths. Completeness is encouraged by rewarding captions that answer visual queries extracted from the image using diverse LVLMs, while correctness is enforced by penalizing hallucinations through authenticity validation of sub-caption queries. Experiments on standard captioning benchmarks demonstrate that CCCaption improves caption quality by moving beyond imitation of potentially flawed human annotations.

Key Contribution

Ditch imperfect human annotations: this dual-reward RL approach trains image captioning models to be both more complete and more factually correct.

Abstract

Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references. Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models. We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?). To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate \textbf{C}omplete and \textbf{C}orrect \textbf{Captions}. For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency. For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries, which are derived from the caption decomposition. Our symmetric dual-reward optimization jointly maximizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria. Extensive experiments across standard captioning benchmarks show consistent improvements, offering a principled path to training caption models beyond human-annotation imitation.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

Related Papers