NoisEasier: Test-Time Noise Optimization for Text-to-Video Generation

Yujiang Pu         Yu Kong        
Michigan State University

Abstract

Diffusion models have recently advanced text-to-video (T2V) generation, yet they still struggle with fine-grained compositional alignment, such as attribute binding, spatial relations, and object interactions. While reward-based model fine-tuning offers a potential remedy, it is susceptible to reward hacking and may generalize poorly to unseen prompt distributions. In this work, we propose NoisEasier, a practical test-time scaling framework that improves T2V generation by directly refining the latent noise using differentiable reward feedback. To enable full gradient backpropagation in video diffusion, we leverage video consistency models that compress sampling to 4–8 denoising steps, enabling end-to-end noise refinement within realistic inference budgets. To mitigate reward hacking, we adopt a multi-reward formulation that balances semantic alignment and temporal coherence, and introduce negative-aware reward calibration to strengthen compositional feedback beyond pairwise preference models. Experiments on VBench and T2V-CompBench demonstrate that NoisEasier substantially improves compositional alignment over baseline models with modest additional inference cost. Overall, NoisEasier provides a flexible alternative to reward-based fine-tuning and a complementary add-on when fine-tuned models are available.

T2V-Turbo (VC2) + NoisEasier

Examples from T2V-CompBench

T2V-Turbo (VC2) + NoisEasier

Examples from VBench

T2V-Turbo (MS) + NoisEasier

Examples from T2V-CompBench

T2V-Turbo (MS) + NoisEasier

Examples from VBench