May 6, 2026arXiv:2605.04461

Stream-T1: Test-Time Scaling for Streaming Video Generation

Yijing Tu, Wenchuan Wang, Chunxiao Liu, Zhendong Mao

AI Summary

This paper introduces Stream-T1, a novel test-time scaling (TTS) framework designed specifically for streaming video generation using diffusion models. Stream-T1 leverages chunk-level synthesis and a limited number of denoising steps to reduce computational costs and enable fine-grained temporal control through three key components: scaled noise propagation, reward pruning, and memory sinking. Experiments on 5s and 30s video benchmarks demonstrate that Stream-T1 significantly improves temporal consistency, motion smoothness, and frame-level visual quality compared to existing methods.

Key Contribution

Achieve superior video generation quality and temporal coherence without expensive retraining by intelligently scaling and steering diffusion models at test time.

Abstract

While Test-Time Scaling (TTS) offers a promising direction to enhance video generation without the surging costs of training, current test-time video generation methods based on diffusion models suffer from exorbitant candidate exploration costs and lack temporal guidance. To address these structural bottlenecks, we propose shifting the focus to streaming video generation. We identify that its chunk-level synthesis and few denoising steps are intrinsically suited for TTS, significantly lowering computational overhead while enabling fine-grained temporal control. Driven by this insight, we introduced Stream-T1, a pioneering comprehensive TTS framework exclusively tailored for streaming video generation. Specifically, Stream-T1 is composed of three units: (1) Stream -Scaled Noise Propagation, which actively refines the initial latent noise of the generating chunk using historically proven, high-quality previous chunk noise, effectively establishes temporal dependency and utilizing the historical Gaussian prior to guide the current generation; (2) Stream -Scaled Reward Pruning, which comprehensively evaluates generated candidates to strike an optimal balance between local spatial aesthetics and global temporal coherence by integrating immediate short-term assessments with sliding-window-based long-term evaluations; (3) Stream-Scaled Memory Sinking, which dynamically routes the context evicted from KV-cache into distinct updating pathways guided by the reward feedback, ensuring that previously generated visual information effectively anchors and guides the subsequent video stream. Evaluated on both 5s and 30s comprehensive video benchmarks, Stream-T1 demonstrates profound superiority, significantly improving temporal consistency, motion smoothness, and frame-level visual quality.

Computer Vision Speech & Audio

Citation Metrics

Citations0

Influential citations0

References57

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Stream-T1: Test-Time Scaling for Streaming Video Generation

Related Papers