Shanghai AI LabSJTUFeb 4, 2026arXiv:2602.04565

Understanding Degradation with Vision Language Model

Guan-Wei Lan, Chenyi Liao, Yuqi Yang, Qianli Ma, Zhigang Wang, Dong Wang, Bin Zhao, Xuelong Li

AI Summary

This paper addresses the challenge of understanding image degradations by reformulating it as a hierarchical structured prediction task involving the estimation of degradation types, parameter keys, and continuous physical values. The authors unify these sub-tasks under an autoregressive next-token prediction paradigm and introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning. The results demonstrate that DU-VLM outperforms generalist baselines and can serve as a zero-shot controller for pre-trained diffusion models for image restoration, enabled by a new large-scale dataset, DU-110k.

Key Contribution

VLMs can be taught to understand the physics of image degradation well enough to control diffusion models for zero-shot image restoration, without fine-tuning the generative backbone.

Abstract

Understanding visual degradations is a critical yet challenging problem in computer vision. While recent Vision-Language Models (VLMs) excel at qualitative description, they often fall short in understanding the parametric physics underlying image degradations. In this work, we redefine degradation understanding as a hierarchical structured prediction task, necessitating the concurrent estimation of degradation types, parameter keys, and their continuous physical values. Although these sub-tasks operate in disparate spaces, we prove that they can be unified under one autoregressive next-token prediction paradigm, whose error is bounded by the value-space quantization grid. Building on this insight, we introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning using structured rewards. Furthermore, we show that DU-VLM can serve as a zero-shot controller for pre-trained diffusion models, enabling high-fidelity image restoration without fine-tuning the generative backbone. We also introduce \textbf{DU-110k}, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations. Extensive experiments demonstrate that our approach significantly outperforms generalist baselines in both accuracy and robustness, exhibiting generalization to unseen distributions.

Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Understanding Degradation with Vision Language Model

Related Papers