Stanford HAISalesforce AIUNCV baseline from CogVideoX [98]Jan 15, 2026arXiv:2601.10781

Future Optical Flow Prediction Improves Robot Control&Video Generation

Kanchana Ranasinghe, Honglu Zhou, Yu Fang, Luyu Yang, Le Xue, Ran Xu, Caiming Xiong, Silvio Savarese, Michael S. Ryoo, Juan Carlos Niebles

AI Summary

The paper introduces FOFPred, a language-conditioned optical flow forecasting model that combines a Vision-Language Model (VLM) with a Diffusion architecture. This model is trained on web-scale human activity data using specific preprocessing techniques to extract meaningful signals from noisy video-caption pairs. FOFPred demonstrates strong performance in both robotic manipulation and video generation tasks, highlighting the benefits of the unified architecture and scalable web data learning for predicting future motion.

Key Contribution

A unified Vision-Language Model and Diffusion architecture unlocks surprisingly effective optical flow forecasting from noisy web data, enabling language-conditioned robot control and video generation.

Abstract

Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.

Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References111

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Future Optical Flow Prediction Improves Robot Control&Video Generation

Related Papers