NTUMay 28, 2026arXiv:2605.29766

MARS Policy: Multimodality Only When It Matters

Jindou Jia, Tuo An, Gen Li, Jingliang Li, Bohan Hou, Xiangyu Chen, Jiaqiang Bai, Bofan Lyu, Jianfei Yang

AI Summary

The paper introduces Modality-Adaptive Robot Sampling (MARS) policy, an imitation learning approach that selectively activates multimodal generation only when behavioral diversity is beneficial in robotic manipulation tasks. MARS bridges the gap between multimodal generative policies and efficient deterministic models by adaptively injecting noise at appropriate times. Experiments across simulated and real-world tasks show MARS improves success rates by 16.67% and reduces inference latency by 83.20% compared to existing methods.

Key Contribution

Robots can learn faster and perform better by strategically choosing *when* to be unpredictable, achieving both multimodal expressivity and deterministic efficiency.

Abstract

Imitation learning has become a cornerstone for solving complex robotic manipulation tasks. In particular, multimodality, which enables robots to capture diverse yet valid behavioral patterns, has driven the rapid emergence of generative policies as a dominant paradigm in robot learning. However, achieving such multimodality typically relies on stochastic noise initialization and iterative denoising procedures, resulting in substantial training complexity and low inference efficiency. Meanwhile, not all phases of a robotic task inherently require behavioral diversity. Motivated by this insight, we propose the Modality-Adaptive Robot Sampling (MARS) policy, which adaptively invokes tailored stochasticity only when it is truly beneficial, while reverting to an efficient deterministic learning during single-modal phases. In other words, the proper amount of noise is injected only at the proper time. By selectively activating multimodal generation, MARS policy bridges the gap between the multimodal capability of generative policies and the superior training and inference efficiency of deterministic models. Empirical studies across 8 simulated and 4 real-world tasks demonstrate that MARS exhibits robust multimodal expressivity and high efficiency, with a 16.67% success rate improvement and an 83.20% inference latency reduction in real-world tests. Counterintuitively, MARS also outpaces deterministic policies in training efficiency on near-deterministic tasks by more effectively modeling nuanced action diversity.

Multimodal Models Robotics & Embodied AI Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MARS Policy: Multimodality Only When It Matters

Related Papers