Direct Noise Optimization for Text-to-Video Generation

Yujiang Pu         Yu Kong        
Michigan State University

Abstract

Diffusion models have significantly advanced text-to-video (T2V) generation, yet they still struggle with fine-grained compositional alignment, including accurate attribute binding, spatial relationships, and object interactions. While reward-based fine-tuning offers a potential remedy, it is susceptible to reward hacking and may fail to generalize to unseen prompt distributions. In this work, we propose NoisEasier, a test-time training framework that improves T2V generation by directly refining latent noise with differentiable rewards. Built upon fast video consistency models, our method enables efficient gradient-based noise optimization within 4 denoising steps. To mitigate reward hacking, we integrate multiple reward objectives that balance semantic alignment and temporal coherence, and propose a negative-aware reward calibration strategy that strengthens compositional feedback beyond pairwise preference models. Experiments on VBench and T2V-CompBench demonstrate that NoisEasier substantially improves compositional alignment over strong open-source baselines and even surpasses commercial models like Pika and Kling. These results show that noise optimization provides a practical, flexible, and effective alternative to full model fine-tuning for aligned T2V generation.

T2V-Turbo (VC2) + NoisEasier

Examples from T2V-CompBench

T2V-Turbo (VC2) + NoisEasier

Examples from VBench

T2V-Turbo (MS) + NoisEasier

Examples from T2V-CompBench

T2V-Turbo (MS) + NoisEasier

Examples from VBench