Search papers, labs, and topics across Lattice.
This paper addresses the problem of long-term 3D reconstruction under substantial appearance change by proposing a joint Structure-from-Motion (SfM) pipeline that directly enforces cross-session correspondences. The key insight is that post-hoc alignment of independently reconstructed sessions fails under large temporal appearance changes, necessitating a joint reconstruction approach. The proposed method combines handcrafted and learned visual features for robust cross-session correspondence, coupled with visual place recognition to improve scalability and robustness, demonstrating successful joint reconstruction on long-term coral reef datasets where existing methods fail.
Achieve coherent 3D reconstruction across years of visual change by jointly optimizing SfM with learned and handcrafted features, even when standard pipelines crumble.
Long-term environmental monitoring requires the ability to reconstruct and align 3D models across repeated site visits separated by months or years. However, existing Structure-from-Motion (SfM) pipelines implicitly assume near-simultaneous image capture and limited appearance change, and therefore fail when applied to long-term monitoring scenarios such as coral reef surveys, where substantial visual and structural change is common. In this paper, we show that the primary limitation of current approaches lies in their reliance on post-hoc alignment of independently reconstructed sessions, which is insufficient under large temporal appearance change. We address this limitation by enforcing cross-session correspondences directly within a joint SfM reconstruction. Our approach combines complementary handcrafted and learned visual features to robustly establish correspondences across large temporal gaps, enabling the reconstruction of a single coherent 3D model from imagery captured years apart, where standard independent and joint SfM pipelines break down. We evaluate our method on long-term coral reef datasets exhibiting significant real-world change, and demonstrate consistent joint reconstruction across sessions in cases where existing methods fail to produce coherent reconstructions. To ensure scalability to large datasets, we further restrict expensive learned feature matching to a small set of likely cross-session image pairs identified via visual place recognition, which reduces computational cost and improves alignment robustness.