NoisEasier

Diffusion models have recently advanced text-to-video (T2V) generation, yet they still struggle with fine-grained compositional alignment, such as attribute binding, spatial relations, and object interactions. While reward-based model fine-tuning offers a potential remedy, it is susceptible to reward hacking and may generalize poorly to unseen prompt distributions. In this work, we propose NoisEasier, a practical test-time scaling framework that improves T2V generation by directly refining the latent noise using differentiable reward feedback. To enable full gradient backpropagation in video diffusion, we leverage video consistency models that compress sampling to 4–8 denoising steps, enabling end-to-end noise refinement within realistic inference budgets. To mitigate reward hacking, we adopt a multi-reward formulation that balances semantic alignment and temporal coherence, and introduce negative-aware reward calibration to strengthen compositional feedback beyond pairwise preference models. Experiments on VBench and T2V-CompBench demonstrate that NoisEasier substantially improves compositional alignment over baseline models with modest additional inference cost. Overall, NoisEasier provides a flexible alternative to reward-based fine-tuning and a complementary add-on when fine-tuned models are available.

NoisEasier: Test-Time Noise Optimization for Text-to-Video Generation

Abstract

T2V-Turbo (VC2) + NoisEasier

T2V-Turbo (VC2) + NoisEasier

T2V-Turbo (MS) + NoisEasier

T2V-Turbo (MS) + NoisEasier