
Existing works remain limited to creating "static" visual instructions, i.e., single images that solely depict action execution or final object states. We take the first step towards shifting instructional image generation to video generation. While general image-to-video (I2V) models can animate images based on text prompts, they primarily focus on artistic creation, overlooking the evolution of object states and action transitions in instructional scenarios. To this end, we introduce ShowMe, a novel framework that enables plausible action-object state manipulation and coherent state prediction. Our key finding suggests that video diffusion models can inherently serve as action-object state transformers, showing great potential for performing state manipulation while ensuring contextual consistency. Both qualitative and quantitative experiments demonstrate the effectiveness and superiority of the proposed method.