Search papers, labs, and topics across Lattice.
This paper introduces π-StepNFT, a novel online reinforcement learning framework for flow-based vision-language-action models that avoids intractable likelihoods and auxiliary value networks. The key insight is that wider exploration spaces in VLAs require finer-grained, step-wise guidance for effective alignment during multi-step sampling. Experiments on LIBERO and ManiSkill demonstrate that π-StepNFT achieves competitive few-shot robustness and superior generalization, particularly in out-of-distribution scenarios, compared to value-based methods.
Flow-based VLAs can now learn online without likelihoods or value networks, unlocking better generalization in complex embodied control tasks.
Flow-based vision-language-action (VLA) models excel in embodied control but suffer from intractable likelihoods during multi-step sampling, hindering online reinforcement learning. We propose \textit{boldsymbolπ-StepNFT} (Step-wise Negative-aware Fine-Tuning), a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks. We identify that wider exploration spaces necessitate finer-grained, step-wise guidance for alignment. Empirically, π-StepNFT unlocks latent potential on LIBERO with competitive few-shot robustness. Moreover, it achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features. This property offers a scalable solution promising for complex real-world applications.