FOFPred: Language Driven Future Optical Flow Prediction

Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored.

We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction.

Our model is trained on web-scale human activity data—a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining.

The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.

Our unified VLM-Diffusion architecture combines strong multimodal reasoning capabilities with pixel-level generative fidelity for language-conditioned future optical flow prediction.

We first visualize optical flow predictions of our base model. Then we showcase our model's performance on the two downstream tasks: robotic manipulation and video generation.

We visualize optical flow predictions of our base model for a randomly selected set of image-prompt pairs, highlighting both the strengths and limitations of our method.

FOFPred enables controllable video generation through predicted optical flow. Both baseline and ours use the same first-frame image and prompt to generate videos. For ours, we visualize both the predicted optical flow and the final generated video.

We illustrate examples from the RoboTwin environment where we apply a fine-tuned version of our FOFPred model to perform language-driven robotic manipulation tasks. For further details on implementation and experimentations, please refer to our code and paper.

BibTeX

@article{Ran26FOFPred,
  title   = {Language Driven Future Optical Flow Prediction},
  author  = {Ranasinghe, Kanchana and Zhou, Honglu and Fang, Yu and 
             Yang, Luyu and Xue, Le and Xu, Ran and Xiong, Caiming and 
             Savarese, Silvio and Ryoo, Michael S. and Niebles, Juan Carlos},
  journal = {arXiv},
  year    = {2026}
}

Language Driven Future Optical Flow Prediction

FOFPred is a diffusion-based model that predicts future optical flow from a single image guided by natural language instructions.

Abstract

Key Contributions

Unified VLM-Diffusion

Language-Conditioned

Web-Scale Training

Method Overview

Results

Optical Flow Prediction

Motion Focused Video Generation

Robotic Manipulation

BibTeX