Search papers, labs, and topics across Lattice.
This paper tackles the challenge of limited 4D datasets for 4D content generation by transferring spatial priors from 3D diffusion models and temporal priors from video diffusion models. They introduce a Spatial-Temporal-Disentangled 4D (STD-4D) Diffusion model and an Orthogonal Spatial-temporal Distributional Transfer (Orster) mechanism to effectively inject these priors. The method further employs a spatial-temporal-aware HexPlane (ST-HexPlane) to improve 4D deformation and Gaussian feature modeling, leading to improved spatial-temporal consistency and quality in 4D synthesis.
Overcome the scarcity of 4D training data by cleverly borrowing spatial understanding from 3D models and temporal dynamics from video models.
In the AIGC era, generating high-quality 4D content has garnered increasing research attention. Unfortunately, current 4D synthesis research is severely constrained by the lack of large-scale 4D datasets, preventing models from adequately learning the critical spatial-temporal features necessary for high-quality 4D generation, thus hindering progress in this domain. To combat this, we propose a novel framework that transfers rich spatial priors from existing 3D diffusion models and temporal priors from video diffusion models to enhance 4D synthesis. We develop a spatial-temporal-disentangled 4D (STD-4D) Diffusion model, which synthesizes 4D-aware videos through disentangled spatial and temporal latents. To facilitate the best feature transfer, we design a novel Orthogonal Spatial-temporal Distributional Transfer (Orster) mechanism, where the spatiotemporal feature distributions are carefully modeled and injected into the STD-4D Diffusion. Furthermore, during the 4D construction, we devise a spatial-temporal-aware HexPlane (ST-HexPlane) to integrate the transferred spatiotemporal features, thereby improving 4D deformation and 4D Gaussian feature modeling. Experiments demonstrate that our method significantly outperforms existing approaches, achieving superior spatial-temporal consistency and higher-quality 4D synthesis.