Feb 13, 2026arXiv:2602.13013

Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

Yunheng Li, Hengrui Zhang, Wenzhao Gao, Shaoyong Jia, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng

AI Summary

The authors address the limitations of existing video-instruction data by introducing ASID-1M, a dataset of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision, and ASID-Verify, a data curation pipeline for annotation with automatic verification and refinement. They then train ASID-Captioner, a video understanding model, via Supervised Fine-Tuning (SFT) on ASID-1M. Experiments demonstrate that ASID-Captioner improves fine-grained caption quality, reduces hallucinations, and enhances instruction following, achieving state-of-the-art performance among open-source models.

Key Contribution

Forget incomplete video descriptions: a new dataset and training pipeline yields video captioning models that are competitive with Gemini-3-Pro while reducing hallucinations.

Abstract

Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descriptions, lacking fine-grained organization and reliable annotation. To address this, we introduce: (i) ASID-1M, an open-source collection of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable data curation pipeline for annotation, with automatic verification and refinement that enforces semantic and temporal consistency between descriptions and the corresponding audiovisual content; and (iii) ASID-Captioner, a video understanding model trained via Supervised Fine-Tuning (SFT) on the ASID-1M. Experiments across seven benchmarks covering audiovisual captioning, attribute-wise captioning, caption-based QA, and caption-based temporal grounding show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following. It achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

Related Papers