Search papers, labs, and topics across Lattice.
The paper introduces ViSIL, an information-theoretic metric for evaluating information loss in multimodal video captioning by quantifying the video information not captured by a summary. ViSIL leverages vision-language model (VLM) inference to enable direct comparison across multimodal summary formats, addressing the limitations of traditional metrics like BLEU or ROUGE. Experiments demonstrate that ViSIL correlates with human and VLM performance on VQA tasks and facilitates summary selection to optimize the trade-off between information loss and processing speed, achieving a 7% improvement in VQA accuracy compared to text summaries.
Ditch BLEU and ROUGE: ViSIL offers a unified metric for multimodal video captioning that actually correlates with VQA performance and human judgment by measuring information loss via VLM inference.
Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a statistically significant correlation with both human and VLM performance on Video Question Answering (VQA) tasks. ViSIL also enables summary selection to optimize the trade-off between information loss and processing speed, establishing a Pareto-optimal frontier that outperforms text summaries by $7\%$ in VQA accuracy without increasing processing load.