Sookmyung Women’s UniversityYonseiApr 16, 2026arXiv:2604.14630

CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

I. Jeon, Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Minseok Kang, Jungho Lee, Chaewon Park, Donghyeong Kim, Sangyoun Lee

AI Summary

This paper introduces Cross-Modal Token Modulation (CMTM), a novel approach for unsupervised video object segmentation that strengthens the interaction between appearance and motion cues using a two-stream architecture. CMTM establishes dense connections between tokens from each modality via relation transformer blocks and incorporates a token masking strategy to improve learning efficiency. The proposed method achieves state-of-the-art performance on public benchmarks, demonstrating the effectiveness of cross-modal interaction for this task.

Key Contribution

Unsupervised video object segmentation gets a boost from CMTM, a new method that intelligently mixes appearance and motion cues using transformers and token masking to achieve SOTA results.

Abstract

Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

Related Papers