HuggingFaceUCSDWorld LabsJun 11, 2026arXiv:2606.13655

Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

Jen-Hao Cheng, Jen-Hao Cheng, Yipeng Wang, Hao Zhang, Hao Zhang, Gengshan Yang, Gengshan Yang, Jenq-Neng Hwang, Jenq-Neng Hwang

AI Summary

Flex4DHuman introduces a novel multi-view video diffusion model that transforms monocular or sparse multi-view videos into synchronized dense multi-view outputs using relative camera-pose conditioning, without relying on explicit geometry priors. This approach leverages a five-axis positional encoding and a three-stage curriculum to effectively train the model for pose following and flexible view generation. Experimental results demonstrate that Flex4DHuman outperforms existing methods in dynamic 4D reconstruction and can generalize to animal categories, marking a significant advancement in scalable 4D content creation from standard video inputs.

Key Contribution

Transforming casual monocular videos into dynamic 4D Gaussian splats without explicit geometry priors could revolutionize content creation in simulation, gaming, and AR/VR.

Abstract

We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show that Flex4DHuman surpasses prior state-of-the-art methods, while the same formulation generalizes to animal categories after mixed human-animal training. These capabilities make Flex4DHuman a practical step toward scalable 4D content creation from casual monocular videos for simulation, gaming, AR/VR, and video re-shooting.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References46

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

Related Papers