Search papers, labs, and topics across Lattice.
The paper introduces CodecCap, a dense video captioning framework inspired by video codecs, representing videos with keyframe and residual captions to balance visual fidelity and redundancy. They also introduce VidCapQA, a caption-then-QA benchmark to quantify caption fidelity, revealing that direct VLM-generated captions miss many visual details. Experiments demonstrate that CodecCap significantly outperforms direct captioning, and the framework is used to create CodecVDC-100K, a large-scale dense captioning dataset.
Keyframe-residual captioning unlocks high-fidelity video-language supervision, surpassing direct VLM captioning in capturing fine-grained visual details.
Existing video captioning methods struggle to balance visual fidelity and redundancy: holistic captions are compact but lose fine-grained evidence, whereas segment-wise captions improve coverage but introduce heavy redundancy. We propose CodecCap, a codec-inspired framework for high-fidelity dense video captioning. Analogous to video codecs, CodecCap represents videos using keyframe and residual captions. Keyframe captions exhaustively encode stable visual context, while residual captions capture temporally only localized actions, motions and changes. This effectively preserves fine-grained visual evidence while reducing redundant descriptions. To quantify the fidelity of captions, we introduce VidCapQA, a caption-then-QA benchmark with 1,000 questions across 14 capability dimensions. Results on VidCapQA show that captions directly generated by strong VLMs still miss many visual details, highlighting caption representation as a critical bottleneck. Experiments show that CodecCap significantly surpasses direct captioning with the same underlying VLMs, suggesting keyframe-residual captioning a way for high-fidelity video-language supervision. We further use CodecCap to construct CodecVDC-100K, a large-scale dense captioning dataset with anchor, residual, scene-level, and video-level supervision.