Search papers, labs, and topics across Lattice.
This paper addresses the challenge of understanding image degradations by reformulating it as a hierarchical structured prediction task involving the estimation of degradation types, parameter keys, and continuous physical values. The authors unify these sub-tasks under an autoregressive next-token prediction paradigm and introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning. The results demonstrate that DU-VLM outperforms generalist baselines and can serve as a zero-shot controller for pre-trained diffusion models for image restoration, enabled by a new large-scale dataset, DU-110k.
VLMs can be taught to understand the physics of image degradation well enough to control diffusion models for zero-shot image restoration, without fine-tuning the generative backbone.
Understanding visual degradations is a critical yet challenging problem in computer vision. While recent Vision-Language Models (VLMs) excel at qualitative description, they often fall short in understanding the parametric physics underlying image degradations. In this work, we redefine degradation understanding as a hierarchical structured prediction task, necessitating the concurrent estimation of degradation types, parameter keys, and their continuous physical values. Although these sub-tasks operate in disparate spaces, we prove that they can be unified under one autoregressive next-token prediction paradigm, whose error is bounded by the value-space quantization grid. Building on this insight, we introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning using structured rewards. Furthermore, we show that DU-VLM can serve as a zero-shot controller for pre-trained diffusion models, enabling high-fidelity image restoration without fine-tuning the generative backbone. We also introduce \textbf{DU-110k}, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations. Extensive experiments demonstrate that our approach significantly outperforms generalist baselines in both accuracy and robustness, exhibiting generalization to unseen distributions.