AI LabSJTUApr 21, 2026arXiv:2604.19747

AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

Yutian Chen, Shida Guo, Shi Guo, Renbiao Jin, Tianshuo Yang, Xin Cai, Yawen Luo, Ming-Hsuan Yang, Mingxin Yang, Mulin Yu, Linning Xu, Tianfan Xue

AI Summary

AnyRecon is introduced, a diffusion-based 3D reconstruction framework that leverages a persistent global scene memory and geometry-aware conditioning for improved scalability and geometric consistency with arbitrary, unordered sparse inputs. By using a prepended capture view cache and maintaining frame-level correspondence, AnyRecon avoids the limitations of single or dual-frame conditioning common in existing diffusion methods. The method also employs 4-step diffusion distillation and context-window sparse attention for efficiency, enabling robust reconstruction across diverse scenes and viewpoints.

Key Contribution

Diffusion models can now handle arbitrary, unordered sparse inputs for 3D reconstruction, achieving robust and scalable performance across irregular viewpoints and long trajectories.

Abstract

Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.

Computer Vision Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References30

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

Related Papers