ÖrebroSchindlerMar 12, 2026arXiv:2603.12064

Dense Dynamic Scene Reconstruction and Camera Pose Estimation from Multi-View Videos

Shuo Sun, U. Artan, Unal Artan, Malcolm Mielle, Achim J. Lilienthaland, Martin Magnusson

AI Summary

This paper introduces a two-stage optimization framework for dense dynamic scene reconstruction and camera pose estimation from multi-view videos captured by freely moving cameras. The first stage extends visual SLAM to a multi-camera setting using a spatiotemporal connection graph and wide-baseline initialization for robust camera tracking. The second stage refines depth and camera poses by optimizing dense inter- and intra-camera consistency using wide-baseline optical flow, achieving state-of-the-art performance on both synthetic and real-world datasets while using less memory.

Key Contribution

Unlock accurate 3D scene reconstruction from multiple uncalibrated, moving cameras with a new framework that beats existing methods while using less memory.

Abstract

We address the challenging problem of dense dynamic scene reconstruction and camera pose estimation from multiple freely moving cameras -- a setting that arises naturally when multiple observers capture a shared event. Prior approaches either handle only single-camera input or require rigidly mounted, pre-calibrated camera rigs, limiting their practical applicability. We propose a two-stage optimization framework that decouples the task into robust camera tracking and dense depth refinement. In the first stage, we extend single-camera visual SLAM to the multi-camera setting by constructing a spatiotemporal connection graph that exploits both intra-camera temporal continuity and inter-camera spatial overlap, enabling consistent scale and robust tracking. To ensure robustness under limited overlap, we introduce a wide-baseline initialization strategy using feed-forward reconstruction models. In the second stage, we refine depth and camera poses by optimizing dense inter- and intra-camera consistency using wide-baseline optical flow. Additionally, we introduce MultiCamRobolab, a new real-world dataset with ground-truth poses from a motion capture system. Finally, we demonstrate that our method significantly outperforms state-of-the-art feed-forward models on both synthetic and real-world benchmarks, while requiring less memory.

Computer Vision Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Dense Dynamic Scene Reconstruction and Camera Pose Estimation from Multi-View Videos

Related Papers