Search papers, labs, and topics across Lattice.
This paper formalizes diffusion-based image editing as guided transport on a learned image manifold, unifying diverse editing paradigms under a common theoretical framework. It introduces task-agnostic metrics for evaluating instruction adherence, region preservation, semantic consistency, and edit stability. The authors then derive theoretical bounds connecting guidance strength and inversion error to deviations in non-target regions, also characterizing accumulation effects under iterative editing, and validate these findings empirically.
Diffusion-based image editing's impressive flexibility comes with fundamental trade-offs between controllability, faithfulness, consistency, locality, and quality, which this paper exposes with clear theoretical bounds.
Diffusion-based editing has rapidly evolved from curated inpainting tools into general-purpose editors spanning text-guided instruction following, mask-localized edits, drag-based geometric manipulation, exemplar transfer, and training-free composition systems. Despite strong empirical progress, the field lacks a unified treatment of core desiderata that govern practical usability: controllability (how precisely and continuously the user can specify an edit), faithfulness to user intent (semantic alignment to instructions), semantic consistency (preservation of identity and non-target content), locality (containment of changes), and perceptual quality (artifact suppression and detail retention). This paper provides a theoretical and empirical analysis of general diffusion-based image editing, connecting diverse paradigms through a common view of editing as guided transport on a learned image manifold. We first formalize editing as an operator induced by a conditional reverse-time generative process and define task-agnostic metrics capturing instruction adherence, region preservation, semantic consistency, and stability under repeated edits. We then develop theory describing edit dynamics under (i) noise-injection and denoising transport, (ii) inversion-and-edit pipelines and the propagation of inversion errors, and (iii) locality constraints implemented via masked guidance or hard constraints. Under mild Lipschitz assumptions on the learned score or flow field, we derive bounds connecting guidance strength and inversion error to measurable deviations in non-target regions, and we characterize accumulation effects under iterative multi-turn editing. Empirically, we benchmark representative paradigms.