Search papers, labs, and topics across Lattice.
This paper explores the potential of autoregressive policies for real-time execution in Vision-Language-Action models, addressing the limitations of existing diffusion policy approaches. By adjusting the tokenization horizon and employing constrained decoding, the authors demonstrate that autoregressive policies can maintain strict latency bounds while enhancing multi-trajectory decoding performance. The results show that autoregressive policies not only outperform flow-matching counterparts in simulated and real-world environments but also achieve faster task completion speeds, confirming their viability for real-time applications.
Autoregressive policies can achieve real-time execution with superior performance and speed, challenging the dominance of diffusion-based approaches.
Real-time execution, enabled by asynchronous inference that ensures both smooth action trajectories and fast reactivity, is critical for realistic deployments of large-scale Vision-Language-Action models. However, recent work on real-time execution primarily focuses on variants of diffusion policies, even though it is more critical for autoregressive policies given their slower rollout speed in synchronous inference. In contrast, we demonstrate that autoregressive policies can achieve real-time execution by adjusting the tokenization horizon and applying constrained decoding, thereby guaranteeing strict latency bounds that enable multi-trajectory decoding to maximize performance. Across simulated and real-world environments, we find that the autoregressive policy consistently outperforms its equivalent-level flow-matching policy counterpart while achieving significantly improved task completion speeds from synchronous inference. Coupled with the inherent advantages of autoregressive policies, such as faster convergence and better generalizability in instruction-following, these results confirm that autoregressive policies can remain a competitive policy type supporting real-time execution.