ERNIE TeamMay 26, 2026arXiv:2605.26967

CodecCap: High-Fidelity Codec-Inspired Residual Modeling for Dense Video Captioning

Zihan Lin, Songhe Deng, Shuwei He, Danxiang Zhu, Dan Zhang, Yishu Lei, Xianlong Luo, Shikun Feng

AI Summary

The paper introduces CodecCap, a dense video captioning framework inspired by video codecs, representing videos with keyframe and residual captions to balance visual fidelity and redundancy. They also introduce VidCapQA, a caption-then-QA benchmark to quantify caption fidelity, revealing that direct VLM-generated captions miss many visual details. Experiments demonstrate that CodecCap significantly outperforms direct captioning, and the framework is used to create CodecVDC-100K, a large-scale dense captioning dataset.

Key Contribution

Keyframe-residual captioning unlocks high-fidelity video-language supervision, surpassing direct VLM captioning in capturing fine-grained visual details.

Abstract

Existing video captioning methods struggle to balance visual fidelity and redundancy: holistic captions are compact but lose fine-grained evidence, whereas segment-wise captions improve coverage but introduce heavy redundancy. We propose CodecCap, a codec-inspired framework for high-fidelity dense video captioning. Analogous to video codecs, CodecCap represents videos using keyframe and residual captions. Keyframe captions exhaustively encode stable visual context, while residual captions capture temporally only localized actions, motions and changes. This effectively preserves fine-grained visual evidence while reducing redundant descriptions. To quantify the fidelity of captions, we introduce VidCapQA, a caption-then-QA benchmark with 1,000 questions across 14 capability dimensions. Results on VidCapQA show that captions directly generated by strong VLMs still miss many visual details, highlighting caption representation as a critical bottleneck. Experiments show that CodecCap significantly surpasses direct captioning with the same underlying VLMs, suggesting keyframe-residual captioning a way for high-fidelity video-language supervision. We further use CodecCap to construct CodecVDC-100K, a large-scale dense captioning dataset with anchor, residual, scene-level, and video-level supervision.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CodecCap: High-Fidelity Codec-Inspired Residual Modeling for Dense Video Captioning

Related Papers