Search papers, labs, and topics across Lattice.
The paper addresses the problem of physically implausible results in instruction-based image editing by reformulating the task as predicting physical state transitions rather than a discrete image mapping. They introduce PhysicTran38K, a large-scale video dataset of 38K physical transition trajectories, and PhysicEdit, an end-to-end framework using a frozen Qwen2.5-VL model for reasoning and learnable transition queries to guide a diffusion backbone. Experiments demonstrate that PhysicEdit improves physical realism and knowledge-grounded editing compared to existing methods.
Instruction-based image editing can now produce more physically plausible results, thanks to a new method that treats editing as a dynamic physical state transition rather than a static image mapping.
Instruction-based image editing has achieved remarkable success in semantic alignment, yet state-of-the-art models frequently fail to render physically plausible results when editing involves complex causal dynamics, such as refraction or material deformation. We attribute this limitation to the dominant paradigm that treats editing as a discrete mapping between image pairs, which provides only boundary conditions and leaves transition dynamics underspecified. To address this, we reformulate physics-aware editing as predictive physical state transitions and introduce PhysicTran38K, a large-scale video-based dataset comprising 38K transition trajectories across five physical domains, constructed via a two-stage filtering and constraint-aware annotation pipeline. Building on this supervision, we propose PhysicEdit, an end-to-end framework equipped with a textual-visual dual-thinking mechanism. It combines a frozen Qwen2.5-VL for physically grounded reasoning with learnable transition queries that provide timestep-adaptive visual guidance to a diffusion backbone. Experiments show that PhysicEdit improves over Qwen-Image-Edit by 5.9% in physical realism and 10.1% in knowledge-grounded editing, setting a new state-of-the-art for open-source methods, while remaining competitive with leading proprietary models.