Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored.
We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction.
Our model is trained on web-scale human activity data—a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining.
The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.
Leveraging novel architecture that combines Vision-Language Models with Diffusion for Optical Flow Generation.
First approach to leverage natural language for guiding future optical flow prediction in diverse scenarios.
Trained on large-scale human activity data with effective preprocessing for noisy real-world video-caption pairs.
Our unified VLM-Diffusion architecture combines strong multimodal reasoning capabilities with pixel-level generative fidelity for language-conditioned future optical flow prediction.
We first visualize optical flow predictions of our base model. Then we showcase our model's performance on the two downstream tasks: robotic manipulation and video generation.
We visualize optical flow predictions of our base model for a randomly selected set of image-prompt pairs, highlighting both the strengths and limitations of our method.
"Moving the water bottle from right to left."
"Lifting the water bottle up."
FOFPred enables controllable video generation through predicted optical flow. Both baseline and ours use the same first-frame image and prompt to generate videos. For ours, we visualize both the predicted optical flow and the final generated video.
We illustrate examples from the RoboTwin environment where we apply a fine-tuned version of our FOFPred model to perform language-driven robotic manipulation tasks. For further details on implementation and experimentations, please refer to our code and paper.
@article{Ran26FOFPred,
title = {Language Driven Future Optical Flow Prediction},
author = {Ranasinghe, Kanchana and Zhou, Honglu and Fang, Yu and
Yang, Luyu and Xue, Le and Xu, Ran and Xiong, Caiming and
Savarese, Silvio and Ryoo, Michael S. and Niebles, Juan Carlos},
journal = {arXiv},
year = {2026}
}