Show Me:Generating Instructional Videos with Diffusion Models

Existing works remain limited to creating "static" visual instructions, i.e., single images that solely depict action execution or final object states. We take the first step towards shifting instructional image generation to video generation. While general image-to-video (I2V) models can animate images based on text prompts, they primarily focus on artistic creation, overlooking the evolution of object states and action transitions in instructional scenarios. To this end, we introduce ShowMe, a novel framework that enables plausible action-object state manipulation and coherent state prediction. Our key finding suggests that video diffusion models can inherently serve as action-object state transformers, showing great potential for performing state manipulation while ensuring contextual consistency. Both qualitative and quantitative experiments demonstrate the effectiveness and superiority of the proposed method.

Reference
Pu, Y., Huang, Z., Boddeti, V., & Kong, Y. (2026). Show Me: Generating Instructional Videos with Diffusion Models. Winter Conference on Applications of Computer Vision (WACV). Project Webpage




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Ego-centric Video Understanding
  • Industrial Embodied Question-Answering
  • Visual Understanding in the Open World
  • Learning Structured Representation of Human Motion for Understanding, Prediction, and Synthesis