Unifying Action-Object State Manipulation and Prediction with Video Diffusion Models

Yujiang Pu¹ Zhanbo Huang¹ Vishnu Boddeti¹ Yu Kong¹

¹Michigan State University

An illustrative figure or diagram for your abstract

Existing works remain limited to creating "static" visual instructions, i.e., single images that solely depict action execution or final object states. We take the first step towards shifting instructional image generation to video generation. While general image-to-video (I2V) models can animate images based on text prompts, they primarily focus on artistic creation, overlooking the evolution of object states and action transitions in instructional scenarios. To this end, we introduce ShowMe, a novel framework that enables plausible action-object state manipulation and coherent state prediction. Our key finding suggests that video diffusion models can inherently serve as action-object state transformers, showing great potential for performing state manipulation while ensuring contextual consistency. Both qualitative and quantitative experiments demonstrate the effectiveness and superiority of the proposed method.

Action-Object State Manipulation

Our model can serve as an efficient image editor that manipulates the action-object states in the reference image, which preserves good spatial context and consistency.

Action-Object State Prediction

Our model can also generate action execution process based on the instruction and the initial image, depicting action completion with plausible visual effects.