Search papers, labs, and topics across Lattice.
This paper introduces LA-Pose, a method that leverages latent action representations learned via inverse- and forward-dynamics models from unlabeled driving videos for camera pose estimation. By repurposing these latent action features as inputs to a pose estimator and finetuning on limited labeled data, LA-Pose achieves state-of-the-art pose accuracy with significantly less supervision. Experiments on Waymo and PandaSet demonstrate over 10% improvement in pose accuracy compared to existing feed-forward methods.
Self-supervised learning from driving videos can beat fully supervised methods for camera pose estimation, even with orders of magnitude less labeled data.
This paper revisits camera pose estimation through the lens of self-supervised pretraining, focusing on inverse-dynamics pretraining as a scalable alternative to the current trend of fully supervised training with 3D annotations. Concretely, we employ inverse- and forward-dynamics models to learn latent action representations, similar to Genie from large-scale driving videos. Our idea is simple yet effective. Existing methods use latent actions in their original capacity, that is, as action conditioning of world-models or as proxies of robot action parameters in policy networks. Our method, dubbed LA-Pose, repurposes the latent action features as inputs to a camera pose estimator, finetuned on a limited set of high-quality 3D annotations. This formulation enables accurate and generalizable pose prediction while maintaining feed-forward efficiency. Extensive experiments on driving benchmarks show that LA-Pose achieves competitive and even superior performance to state-of-the-art methods while using orders of magnitude less labeled data. Concretely, on the Waymo and PandaSet benchmarks, LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods. To our knowledge, this work is the first to demonstrate the power of inverse-dynamics self-supervised learning for pose estimation.