Search papers, labs, and topics across Lattice.
The paper introduces Condition-Aware Routing of Experts (CARE-Edit), a novel approach to contextual image editing using diffusion models that addresses limitations of static conditioning methods like ControlNet. CARE-Edit employs a latent-attention router to dynamically assign diffusion tokens to specialized experts (Text, Mask, Reference, Base) based on multi-modal conditions and diffusion timesteps, enabling task-specific computation. Experiments demonstrate CARE-Edit's superior performance on various editing tasks by mitigating multi-condition conflicts and improving the integration of semantic, spatial, and stylistic information.
Tired of color bleeding and unpredictable results in multi-modal image editing? CARE-Edit dynamically routes diffusion tokens to specialized experts, finally untangling conflicting conditions for cleaner edits.
Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioning signals (e.g., text, mask, reference) via static concatenation or additive adapters which cannot dynamically prioritize or suppress conflicting modalities, thus resulting in artifacts like color bleeding across mask boundaries, identity or style drift, and unpredictable behavior under multi-condition inputs. To address this, we propose Condition-Aware Routing of Experts (CARE-Edit) that aligns model computation with specific editing competencies. At its core, a lightweight latent-attention router assigns encoded diffusion tokens to four specialized experts--Text, Mask, Reference, and Base--based on multi-modal conditions and diffusion timesteps: (i) a Mask Repaint module first refines coarse user-defined masks for precise spatial guidance; (ii) the router applies sparse top-K selection to dynamically allocate computation to the most relevant experts; (iii) a Latent Mixture module subsequently fuses expert outputs, coherently integrating semantic, spatial, and stylistic information to the base images. Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer. Empirical analysis further reveals task-specific behavior of specialized experts, showcasing the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts.