BITCASMTLabMar 30, 2026arXiv:2603.28367

Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models

Tao Xia, Jiawei Liu, Yukun Zhang, Ting Liu, Wei Wang, Lei Zhang

AI Summary

This paper introduces a novel text-guided image editing framework using visual autoregressive (VAR) models that improves upon existing methods by addressing challenges in token localization and structural consistency. They propose a coarse-to-fine token localization strategy and a feature injection mechanism guided by analysis of structure-related features in VAR model intermediates. Furthermore, they use reinforcement learning to adaptively control feature injection, resulting in superior structural consistency and editing quality compared to existing methods.

Key Contribution

Achieve significantly better structure preservation in text-guided image editing by injecting structure-related features into visual autoregressive models, guided by reinforcement learning.

Abstract

Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However, existing VAR-based editing methods still face two key challenges: accurately localizing editable tokens and maintaining structural consistency in the edited results. In this work, we propose a novel text-guided image editing framework rooted in an analysis of intermediate feature distributions within VAR models. First, we introduce a coarse-to-fine token localization strategy that can refine editable regions, balancing editing fidelity and background preservation. Second, we analyze the intermediate representations of VAR models and identify structure-related features, by which we design a simple yet effective feature injection mechanism to enhance structural consistency between the edited and source images. Third, we develop a reinforcement learning-based adaptive feature injection scheme that automatically learns scale- and layer-specific injection ratios to jointly optimize editing fidelity and structure preservation. Extensive experiments demonstrate that our method achieves superior structural consistency and editing quality compared with state-of-the-art approaches, across both local and global editing scenarios.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models

Related Papers