Search papers, labs, and topics across Lattice.
The paper introduces OFlow, a framework that unifies temporal foresight and object-aware reasoning in a shared semantic latent space for robotic manipulation. OFlow forecasts future latents using temporal flow matching and factorizes them into object-aware representations. Integrating OFlow into Vision-Language-Action (VLA) pipelines improves control reliability under distribution shifts, as demonstrated across multiple benchmarks and real-world tasks.
Robots get a crucial boost in robustness by learning to "see" and predict how objects will move, not just react to the current frame.
Robust robotic manipulation requires not only predicting how the scene evolves over time, but also recognizing task-relevant objects in complex scenes. However, existing VLA models face two limitations. They typically act only on the current frame, while future prediction and object-aware reasoning are often learned in separate latent spaces. We propose OFlow (injecting Object-Aware Temporal Flow Matching into VLAs), a framework that addresses both limitations by unifying temporal foresight and object-aware reasoning in a shared semantic latent space. Our method forecasts future latents with temporal flow matching, factorizes them into object-aware representations that emphasize physically relevant cues while filtering task-irrelevant variation, and conditions continuous action generation on these predictions. By integrating OFlow into VLA pipelines, our method enables more reliable control under distribution shifts. Extensive experiments across LIBERO, LIBERO-Plus, MetaWorld, and SimplerEnv benchmarks and real-world tasks demonstrate that object-aware foresight consistently enhances robustness and success.