An illustrative figure or diagram for your abstract

Generating visual instructions in a given context is essential for developing interactive world simulators. While prior works address this problem through either text-guided image manipulation or video prediction, these tasks are typically treated in isolation. This separation reveals a fundamental issue: image manipulation methods overlook how actions unfold over time, while video prediction models often ignore the intended outcomes. To this end, we propose ShowMe, a unified framework that enables both tasks by selectively activating the spatial and temporal components of video diffusion models. In addition, we introduce structure and motion consistency rewards to improve structural fidelity and temporal coherence. Notably, this unification brings dual benefits: the spatial knowledge gained through video pretraining enhances contextual consistency and realism in non-rigid image edits, while the instruction-guided manipulation stage equips the model with stronger goal-oriented reasoning for video prediction. Experiments on diverse benchmarks demonstrate that our method outperforms expert models in both instructional image and video generation, highlighting the strength of video diffusion models as a unified action-object state transformer.

Instructional Image Generation

Our model can serve as an efficient image editor that manipulates the action-object states in the reference image, which preserves good spatial context and consistency.

Instructional Video Generation

Our model can also generate action execution process based on the instruction and the initial image, depicting action completion with plausible visual effects.

Something-Something v2 Dataset (SSv2)

"unfolding towel"

First frame of 34355
Animated 34355

"folding towel"

First frame of 84116
Animated 84116

"tilting book with package on it until it falls off"

First frame of 39332
Animated 39332

"tearing paper"

First frame of 52838
Animated 52838

"stuffing duvet into washing machine"

First frame of 118252
Animated 118252

"scooping baking soda up with spoon"

First frame of 178000
Animated 178000

"poking tube so that it falls over"

First frame of 161547
Animated 161547

"moving remote and small remote away from each other"

First frame of 51945
Animated 51945

Epic-Kitchens 100 Dataset (Epic100)

"wash plate"

First frame of P01_11_96
Animated P01_11_96

"rinse spoon"

First frame of P01_13_16
Animated P01_13_16

"open fridge"

First frame of P01_12_45
Animated P01_12_45

"put can in fridge"

First frame of P13_03_39
Animated P13_03_39

"open drawer"

First frame of P01_14_117
Animated P01_14_117

"stir food"

First frame of P01_14_181
Animated P01_14_181

"turn off tap"

First frame of P22_02_177
Animated P22_02_177

"cut potatoes"

First frame of P24_09_52
Animated P24_09_52

In-Context Video Generation

Our model can achieve the intended goal using distinct yet feasible action instructions within the same context.

"put plate"

First frame of P30_07_37_0
Animated put plate

"take plate"

First frame of P01_13_16
Animated take plate

"take glass"

First frame of P30_07_37
Animated take glass

"close cupboard"

First frame of P30_07_37
Animated close cupboard