An illustrative figure or diagram for your abstract

Existing works remain limited to creating "static" visual instructions, i.e., single images that solely depict action execution or final object states. We take the first step towards shifting instructional image generation to video generation. While general image-to-video (I2V) models can animate images based on text prompts, they primarily focus on artistic creation, overlooking the evolution of object states and action transitions in instructional scenarios. To this end, we introduce ShowMe, a novel framework that enables plausible action-object state manipulation and coherent state prediction. Our key finding suggests that video diffusion models can inherently serve as action-object state transformers, showing great potential for performing state manipulation while ensuring contextual consistency. Both qualitative and quantitative experiments demonstrate the effectiveness and superiority of the proposed method.

Action-Object State Manipulation

Our model can serve as an efficient image editor that manipulates the action-object states in the reference image, which preserves good spatial context and consistency.

Action-Object State Prediction

Our model can also generate action execution process based on the instruction and the initial image, depicting action completion with plausible visual effects.

Something-Something v2 Dataset (SSv2)

"unfolding towel"

First frame of 34355
Animated 34355

"folding towel"

First frame of 84116
Animated 84116

"tilting book with package on it until it falls off"

First frame of 39332
Animated 39332

"tearing paper"

First frame of 52838
Animated 52838

"stuffing duvet into washing machine"

First frame of 118252
Animated 118252

"scooping baking soda up with spoon"

First frame of 178000
Animated 178000

"poking tube so that it falls over"

First frame of 161547
Animated 161547

"moving remote and small remote away from each other"

First frame of 51945
Animated 51945

Epic-Kitchens 100 Dataset (Epic100)

"wash plate"

First frame of P01_11_96
Animated P01_11_96

"rinse spoon"

First frame of P01_13_16
Animated P01_13_16

"open fridge"

First frame of P01_12_45
Animated P01_12_45

"put can in fridge"

First frame of P13_03_39
Animated P13_03_39

"open drawer"

First frame of P01_14_117
Animated P01_14_117

"stir food"

First frame of P01_14_181
Animated P01_14_181

"turn off tap"

First frame of P22_02_177
Animated P22_02_177

"cut potatoes"

First frame of P24_09_52
Animated P24_09_52

In-Context Video Generation

Our model can achieve the intended goal using distinct yet feasible action instructions within the same context.

"put plate"

First frame of P30_07_37_0
Animated put plate

"take plate"

First frame of P01_13_16
Animated take plate

"take glass"

First frame of P30_07_37
Animated take glass

"close cupboard"

First frame of P30_07_37
Animated close cupboard