Apr 2, 2026arXiv:2604.01693

From Understanding to Erasing: Towards Complete and Stable Video Object Removal

Dingming Liu, Wenjing Wang, Chen Li, Jing Lyu

AI Summary

This paper tackles video object removal by enhancing diffusion models with a deeper understanding of object-scene interactions. They distill knowledge from vision foundation models to capture relationships between objects and their side effects (shadows, reflections), and introduce a framewise context cross-attention mechanism to leverage unmasked context. The resulting model achieves state-of-the-art performance in removing objects and their induced effects while maintaining spatio-temporal consistency, and the authors introduce a new real-world benchmark for further research.

Key Contribution

Removing objects from video now means removing their shadows and reflections too, thanks to a new method that teaches diffusion models to "understand" object-scene physics.

Abstract

Video object removal aims to eliminate target objects from videos while plausibly completing missing regions and preserving spatio-temporal consistency. Although diffusion models have recently advanced this task, it remains challenging to remove object-induced side effects (e.g., shadows, reflections, and illumination changes) without compromising overall coherence. This limitation stems from the insufficient physical and semantic understanding of the target object and its interactions with the scene. In this paper, we propose to introduce understanding into erasing from two complementary perspectives. Externally, we introduce a distillation scheme that transfers the relationships between objects and their induced effects from vision foundation models to video diffusion models. Internally, we propose a framewise context cross-attention mechanism that grounds each denoising block in informative, unmasked context surrounding the target region. External and internal guidance jointly enable our model to understand the target object, its induced effects, and the global background context, resulting in clear and coherent object removal. Extensive experiments demonstrate our state-of-the-art performance, and we establish the first real-world benchmark for video object removal to facilitate future research and community progress. Our code, data, and models are available at: https://github.com/WeChatCV/UnderEraser.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References55

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From Understanding to Erasing: Towards Complete and Stable Video Object Removal

Related Papers