PKUJun 4, 2026arXiv:2606.05730

TextWand: A Unified Framework for Scene Text Editing

Shuyu Wang, Zhile Guan, Hongxiu Chen, Yule Duan, Weiqi Li, Xin Shan, Ronggang Wang

AI Summary

TextWand is a unified framework that integrates scene text removal, generation, and replacement into a single model, enhancing control over text appearance and background integrity. The framework employs Overlay-Reference Positional Encoding (ORPE) for pixel-level layout fidelity and Region-Adaptive Suppression (RAS) for effective text erasure. Extensive evaluations reveal that TextWand surpasses existing models in text content accuracy, layout consistency, and overall image quality across various editing tasks.

Key Contribution

TextWand outperforms leading models in scene text editing by achieving unprecedented accuracy and quality through a novel unified approach.

Abstract

We propose TextWand, a general-purpose framework that unifies scene text removal, generation, and replacement into a single model. By decomposing complex editing tasks into the atomic primitives of rendering and erasure, TextWand achieves precise control over both text appearance and background integrity. Specifically, we introduce a novel design, Overlay-Reference Positional Encoding (ORPE), to enforce pixel-level layout fidelity and exemplar-driven style control, alongside a new strategy, Region-Adaptive Suppression (RAS), to ensure clean text erasure. To address the absence of a comprehensive benchmark for general-purpose scene text editing among existing single-task datasets, we construct TextWand-Bench. Extensive experiments demonstrate that TextWand outperforms existing leading open-source and closed-source models by delivering superior text content accuracy, layout and style consistency, and overall image quality across scene text removal, generation and replacement tasks.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TextWand: A Unified Framework for Scene Text Editing

Related Papers