CASHujing Digital Media and EntertainmentMar 17, 2026arXiv:2603.16966

CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization

Liangbin Huang, Xiaohua Liao, Chaoqun Cui, Chaoqun Cui, Shijing Wang, Shijing Wang, Zhaolong Huang, Yanlong Du, Wenji Mao, Wenji Mao

AI Summary

The paper introduces CineSRD, a multimodal framework for speaker diarization in open-world visual media, leveraging visual anchor clustering for initial speaker registration and an audio language model for speaker turn detection and refinement. This approach addresses challenges like long-form video, numerous speakers, and audiovisual asynchrony. The authors also contribute a new speaker diarization benchmark for visual media in both Chinese and English.

Key Contribution

Speaker diarization in movies and TV shows just got a whole lot better, thanks to a new multimodal framework that uses visual cues, speech, and subtitles to handle the chaos of open-world video.

Abstract

Traditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore open-world speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces several challenges, including long-form video understanding, a large number of speakers, cross-modal asynchrony between audio and visual cues, and uncontrolled in-the-wild variability. To address these challenges, we propose Cinematic Speaker Registration&Diarization (CineSRD), a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles for speaker annotation. CineSRD first performs visual anchor clustering to register initial speakers and then integrates an audio language model for speaker turn detection, refining annotations and supplementing unregistered off-screen speakers. Furthermore, we construct and release a dedicated speaker diarization benchmark for visual media that includes Chinese and English programs. Experimental results demonstrate that CineSRD achieves superior performance on the proposed benchmark and competitive results on conventional datasets, validating its robustness and generalizability in open-world visual media settings.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References38

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization

Related Papers