HITPengcheng LaboratoryShanghai AI LabApr 20, 2026arXiv:2604.18665

APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track

Deshui Miao, Yameng Gu, Chao Yang, Haijun Zhang

AI Summary

This paper details APRVOS, a novel pipeline for Audio-aware Referring Video Object Segmentation (Ref-VOS) designed for spoken referring expressions. The system incorporates speech transcription via VibeVoice-ASR and visual existence verification using an Omni-based module to handle noisy audio-derived queries. It then uses Sa2VA for initial segmentation, followed by an agentic refinement layer leveraging SAM3 to improve spatial and temporal precision.

Key Contribution

By explicitly verifying the visual existence of spoken references before segmentation, APRVOS substantially improves robustness in noisy audio-conditioned Ref-VOS, outperforming standard pipelines.

Abstract

This report presents an Audio-aware Referring Video Object Segmentation (Ref-VOS) pipeline tailored to the MEVIS\_Audio setting, where the referring expression is provided in spoken form rather than as clean text. Compared with a standard Sa2VA-based Ref-VOS pipeline, the proposed system introduces two additional front-end stages: speech transcription and visual existence verification. Specifically, we first employ VibeVoice-ASR to convert long-form spoken input into a structured textual transcript. Since audio-derived queries are inherently noisy and may describe entities that are not visually present in the video, we then introduce an Omni-based judgment module to determine whether the transcribed target can be grounded in the visual content. If the target is judged to be absent, the pipeline terminates early and outputs all-zero masks. Otherwise, the transcript is transformed into a segmentation-oriented prompt and fed into Sa2VA to obtain a coarse mask trajectory over the full video. Importantly, this trajectory is treated as an initial semantic hypothesis rather than a final prediction. On top of it, an agentic refinement layer evaluates query reliability, temporal relevance, anchor quality, and potential error sources, and may invoke SAM3 to improve spatial boundary precision and temporal consistency. The resulting framework explicitly decomposes the MEVIS\_Audio task into audio-to-text conversion, visual existence verification, coarse video segmentation, and agent-guided refinement. Such a staged design is substantially more appropriate for audio-conditioned Ref-VOS than directly sending noisy ASR outputs into a segmentation model.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References26

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track

Related Papers