Search papers, labs, and topics across Lattice.
TextWand is a unified framework that integrates scene text removal, generation, and replacement into a single model, enhancing control over text appearance and background integrity. The framework employs Overlay-Reference Positional Encoding (ORPE) for pixel-level layout fidelity and Region-Adaptive Suppression (RAS) for effective text erasure. Extensive evaluations reveal that TextWand surpasses existing models in text content accuracy, layout consistency, and overall image quality across various editing tasks.
TextWand outperforms leading models in scene text editing by achieving unprecedented accuracy and quality through a novel unified approach.
We propose TextWand, a general-purpose framework that unifies scene text removal, generation, and replacement into a single model. By decomposing complex editing tasks into the atomic primitives of rendering and erasure, TextWand achieves precise control over both text appearance and background integrity. Specifically, we introduce a novel design, Overlay-Reference Positional Encoding (ORPE), to enforce pixel-level layout fidelity and exemplar-driven style control, alongside a new strategy, Region-Adaptive Suppression (RAS), to ensure clean text erasure. To address the absence of a comprehensive benchmark for general-purpose scene text editing among existing single-task datasets, we construct TextWand-Bench. Extensive experiments demonstrate that TextWand outperforms existing leading open-source and closed-source models by delivering superior text content accuracy, layout and style consistency, and overall image quality across scene text removal, generation and replacement tasks.