Search papers, labs, and topics across Lattice.
Flex4DHuman introduces a novel multi-view video diffusion model that transforms monocular or sparse multi-view videos into synchronized dense multi-view outputs using relative camera-pose conditioning, without relying on explicit geometry priors. This approach leverages a five-axis positional encoding and a three-stage curriculum to effectively train the model for pose following and flexible view generation. Experimental results demonstrate that Flex4DHuman outperforms existing methods in dynamic 4D reconstruction and can generalize to animal categories, marking a significant advancement in scalable 4D content creation from standard video inputs.
Transforming casual monocular videos into dynamic 4D Gaussian splats without explicit geometry priors could revolutionize content creation in simulation, gaming, and AR/VR.
We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show that Flex4DHuman surpasses prior state-of-the-art methods, while the same formulation generalizes to animal categories after mixed human-animal training. These capabilities make Flex4DHuman a practical step toward scalable 4D content creation from casual monocular videos for simulation, gaming, AR/VR, and video re-shooting.