FOFPred Overview

FOFPred is a diffusion-based model that predicts future optical flow from a single image guided by natural language instructions.

Abstract

Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored.

We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction.

Our model is trained on web-scale human activity data—a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining.

The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.

Key Contributions

Unified VLM-Diffusion

Leveraging novel architecture that combines Vision-Language Models with Diffusion for Optical Flow Generation.

Language-Conditioned

First approach to leverage natural language for guiding future optical flow prediction in diverse scenarios.

Web-Scale Training

Trained on large-scale human activity data with effective preprocessing for noisy real-world video-caption pairs.




Method Overview

Our unified VLM-Diffusion architecture combines strong multimodal reasoning capabilities with pixel-level generative fidelity for language-conditioned future optical flow prediction.

Method Overview



Results

We first visualize optical flow predictions of our base model. Then we showcase our model's performance on the two downstream tasks: robotic manipulation and video generation.

Optical Flow Prediction

We visualize optical flow predictions of our base model for a randomly selected set of image-prompt pairs, highlighting both the strengths and limitations of our method.



Motion Focused Video Generation

FOFPred enables controllable video generation through predicted optical flow. Both baseline and ours use the same first-frame image and prompt to generate videos. For ours, we visualize both the predicted optical flow and the final generated video.

Baseline (CogVideoX)
Predicted OF (Ours)
Generated Video (Ours)
PROMPT: Moving glue stick away from camera
PROMPT: Pulling toy car from left to right
PROMPT: Push fidget spinner from left to right
PROMPT: Moving bicycle towards camera



Robotic Manipulation

We illustrate examples from the RoboTwin environment where we apply a fine-tuned version of our FOFPred model to perform language-driven robotic manipulation tasks. For further details on implementation and experimentations, please refer to our code and paper.

BibTeX

@article{Ran26FOFPred,
  title   = {Language Driven Future Optical Flow Prediction},
  author  = {Ranasinghe, Kanchana and Zhou, Honglu and Fang, Yu and 
             Yang, Luyu and Xue, Le and Xu, Ran and Xiong, Caiming and 
             Savarese, Silvio and Ryoo, Michael S. and Niebles, Juan Carlos},
  journal = {arXiv},
  year    = {2026}
}