Search papers, labs, and topics across Lattice.
This paper introduces ManiSplat, a novel framework that reconstructs dynamic 3D scenes from monocular robotic videos by employing a Graph-Structured Disentangled Representation to independently optimize Gaussian subfields for robots, objects, and backgrounds. The method addresses the complexities of contact interactions and pose changes through a Task-Oriented Spatio-Temporal Alignment module, ensuring the accuracy of pseudo-ground-truth trajectories during manipulation tasks. Experimental results show that ManiSplat achieves high-fidelity and controllable scene reconstructions, significantly enhancing the performance of downstream robotic tasks and policy learning.
Achieving high-fidelity 3D scene reconstruction from monocular video, ManiSplat enables robots to interact with their environments in a more controllable and realistic manner.
Reconstructing dynamic and interactive 3D scenes from real-world observations remains a fundamental challenge in computer vision and robotics. While recent advances in 3D Gaussian Splatting have enabled high-fidelity static reconstruction, extending it to interactive environments with articulated robots and manipulable objects remains difficult due to complex contact interactions and abrupt pose changes. To address these challenges, we introduce ManiSplat, a unified framework that reconstructs controllable and decoupled Gaussian digital twins directly from monocular ego-view robotic videos. Our method introduces a Graph-Structured Disentangled Representation that separates the robot, objects, and background into independently optimizable Gaussian subfields organized within a scene graph. To ensure stability, we propose a Task-Oriented Spatio-Temporal Alignment module that leverages the inherent logic of manipulation tasks-alternating between Motion and Skill phases-to construct accurate pseudo-ground-truth trajectories. Finally, a joint photometric-geometric optimization ensures the reconstructed scenes are temporally coherent, physically consistent, and simulation-ready. Extensive experiments demonstrate that our approach reconstructs interaction-driven dynamic scenes with high fidelity and controllability, effectively supporting downstream robotic tasks and policy learning.