Search papers, labs, and topics across Lattice.
The paper introduces DeMUSE, a Deep Multimodal Unified Sparse Experts framework, which uses a Diffusion Transformer to fuse RGB, depth, and 6-axis force data into a unified representation for dexterous embodied manipulation. They address representation imbalance across modalities with Adaptive Modality-specific Normalization (AdaMN) and scale the model efficiently using a Sparse Mixture-of-Experts (MoE) architecture. DeMUSE achieves state-of-the-art performance in both simulation (83.2% success rate) and real-world experiments (72.5% success rate), demonstrating the effectiveness of deep multi-sensory fusion for complex manipulation tasks.
Integrating force feedback with vision enables a robot to achieve 72.5% success in real-world dexterous manipulation tasks, outperforming vision-only approaches.
Realizing dexterous embodied manipulation necessitates the deep integration of heterogeneous multimodal sensory inputs. However, current vision-centric paradigms often overlook the critical force and geometric feedback essential for complex tasks. This paper presents DeMUSE, a Deep Multimodal Unified Sparse Experts framework leveraging a Diffusion Transformer to integrate RGB, depth, and 6-axis force into a unified serialized stream. Adaptive Modality-specific Normalization (AdaMN) is employed to recalibrate modality-aware features, mitigating representation imbalance and harmonizing the heterogeneous distributions of multi-sensory signals. To facilitate efficient scaling, the architecture utilizes a Sparse Mixture-of-Experts (MoE) with shared experts, increasing model capacity for physical priors while maintaining the low inference latency required for real-time control. A Joint denoising objective synchronously synthesizes environmental evolution and action sequences to ensure physical consistency. Achieving success rates of 83.2% and 72.5% in simulation and real-world trials, DeMUSE demonstrates state-of-the-art performance, validating the necessity of deep multi-sensory integration for complex physical interactions.