KAUSTMar 4, 2026arXiv:2603.04090

EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

Zhenyu Li, Sai Kumar Dwivedi, Filip Maric, Carlos Chacon, Nadine Bertsch, Filippo Arcadu, Tomas Hodan, Michael Ramamonjisoa, Peter Wonka, Amy Zhao, Robin Kips, Cem Keskin, Anastasia Tkach, Chenhongyi Yang

AI Summary

EgoPoseFormer v2 is introduced to tackle egocentric human motion estimation challenges in AR/VR by using a transformer-based model with identity-conditioned queries, multi-view spatial refinement, and causal temporal attention. To overcome the scarcity of labeled data, an auto-labeling system leverages uncertainty-aware semi-supervised training on large unlabeled datasets. The method achieves state-of-the-art performance on the EgoBody3M benchmark, improving accuracy by up to 19.4% and reducing temporal jitter significantly, while maintaining low latency.

Key Contribution

Achieve real-time egocentric motion capture with 19% better accuracy and half the jitter of prior art, thanks to a transformer architecture and self-supervised pretraining on millions of unlabeled frames.

Abstract

Egocentric human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present EgoPoseFormer v2, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training. Our model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget. The auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher-student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments. On the EgoBody3M benchmark, with a 0.8 ms latency on GPU, our model outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%. Furthermore, our auto-labeling system further improves the wrist MPJPE by 13.1%.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Data Curation & Synthetic Data

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

Related Papers