Search papers, labs, and topics across Lattice.
This paper introduces Cross-Modal Token Modulation (CMTM), a novel approach for unsupervised video object segmentation that strengthens the interaction between appearance and motion cues using a two-stream architecture. CMTM establishes dense connections between tokens from each modality via relation transformer blocks and incorporates a token masking strategy to improve learning efficiency. The proposed method achieves state-of-the-art performance on public benchmarks, demonstrating the effectiveness of cross-modal interaction for this task.
Unsupervised video object segmentation gets a boost from CMTM, a new method that intelligently mixes appearance and motion cues using transformers and token masking to achieve SOTA results.
Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.