Tsinghua AICASPolyUTencent AIFeb 22, 2026arXiv:2602.18996

Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

Leqi Zheng, Keyu Lv, Jingchen Ni, Hongyang Wei, Jiajun Zhang, Guangting Wang, Jing Lyu, Fengyun Rao

AI Summary

This paper tackles the problem of cross-view object correspondence in videos, specifically addressing the egocentric-to-exocentric and vice-versa scenarios. They propose a framework that uses conditional binary segmentation with an object query mask encoded into a latent representation to localize the corresponding object in a target view. A cycle-consistency training objective, projecting the predicted mask back to the source view, provides a self-supervisory signal, leading to state-of-the-art performance on Ego-Exo4D and HANDAL-X datasets, further enhanced by test-time training.

Key Contribution

Cycle consistency unlocks SOTA cross-view object correspondence in videos without ground-truth annotations, even enabling test-time training.

Abstract

We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance. The code is available at https://github.com/shannany0606/CCMP.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References61

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

Related Papers