HuaweiNankai UniversityNKIARIApr 30, 2026arXiv:2604.27322

YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal

Chenyang Wu, Lina Lei, Fan Li, Chun-Le Guo, Dehong Kong, Xinran Qin, Zhixin Wang, Ming-Ming Cheng, Chongyi Li

AI Summary

This paper introduces YOSE, a fine-tuning framework for Diffusion Transformer (DiT)-based video object removal that significantly reduces inference latency. YOSE employs Batch Variable-length Indexing (BVI) to adaptively select essential tokens based on mask information and a Diffusion Process Simulator (DiffSim) to maintain semantic consistency for masked tokens. Experiments show YOSE achieves up to 2.5X speedup in 70% of cases while preserving visual quality comparable to state-of-the-art methods.

Key Contribution

Achieve up to 2.5x faster video object removal with comparable visual quality by intelligently selecting only the essential tokens for processing in Diffusion Transformers.

Abstract

Recent advances in Diffusion Transformer (DiT)-based video generation technologies have shown impressive results for video object removal. However, these methods still suffer from substantial inference latency. For instance, although MiniMax Remover achieves state-of-the-art visual quality, it operates at only around 10FPS, primarily due to dense computations over the entire spatiotemporal token space, even when only a small masked region actually requires processing. In this paper, we present YOSE, You Only Select Essential Tokens, an efficient fine-tuning framework. YOSE introduces two key components: Batch Variable-length Indexing (BVI) and Diffusion Process Simulator (DiffSim) Module. BVI is a differentiable dynamic indexing operator that adaptively selects essential tokens based on mask information, enabling variable-length token processing across samples. DiffSim provides a diffusion process approximation mechanism for unmasked tokens, which simulates the influence of unmasked regions within DiT self-attention to maintain semantic consistency for masked tokens. With these designs, YOSE achieves mask-aware acceleration, where the inference time scales approximately linearly with the masked regions, in contrast to full-token diffusion methods whose computation remains constant regardless of the mask size. Extensive experiments demonstrate that YOSE achieves up to 2.5X speedup in 70% of cases while maintaining visual quality comparable to the baseline. Code is available at: https://github.com/Wucy0519/YOSE-CVPR26.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal

Related Papers