Latent Noise Optimization for Text-to-Video Generation

Yujiang Pu         Yu Kong        
Michigan State University

Abstract

Diffusion models have significantly advanced text-to-video (T2V) generation, yet they still struggle with complex prompts involving intricate object interactions and precise attribute binding. While reward-based fine-tuning can improve compositional alignment, it is computationally costly and prone to reward hacking. In this work, we propose NoisEasier, a test-time training framework that improves T2V generation by directly optimizing latent noise with differentiable rewards during inference. To make this practical, we leverage fast video consistency models, enabling full gradient backpropagation within just 4 denoising steps. To mitigate reward hacking, we integrate multiple reward objectives that balance semantic alignment and motion quality, and propose a novel negative-aware reward calibration strategy that uses LLM-generated distractors to provide fine-grained compositional feedback. Experiments on VBench and T2V-CompBench show that NoisEasier consistently improves strong baselines, achieving over 10% gains in several dimensions and even surpassing commercial models like Gen-3 and Kling. Notably, these improvements are achieved within 25 optimization steps, requiring only 45 seconds per sample on two RTX 6000 Ada GPUs. Under the same wall-clock time budget, NoisEasier achieves human preference win rates exceeding CogVideoX-2B by 18.8% and Wan2.1-1.3B by 6.8%, demonstrating a competitive trade-off between performance and efficiency.

T2V-Turbo (VC2) + NoisEasier

Examples from T2V-CompBench

T2V-Turbo (VC2) + NoisEasier

Examples from VBench

T2V-Turbo (MS) + NoisEasier

Examples from T2V-CompBench

T2V-Turbo (MS) + NoisEasier

Examples from VBench