Jun 6, 2026arXiv:2606.08063

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Jiaqi Tang, Jianmin Chen, Youyang Zhai, Wei Wei, Runtao Liu, Mengjie Zhao, Xiangyu Wu, Q. Xiao, Qifeng Chen

AI Summary

This paper introduces Robust-U1, a framework that empowers Multimodal Large Language Models (MLLMs) to autonomously recover corrupted visual content, addressing the significant performance drop these models face under real-world visual corruptions. The approach involves a three-stage process: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards to enhance visual quality, and multimodal reasoning that integrates both the corrupted input and the recovered image. Experimental results show that Robust-U1 not only achieves state-of-the-art robustness against visual corruptions but also improves reasoning performance, highlighting the importance of self-recovery in visual understanding.

Key Contribution

MLLMs can now autonomously recover from visual corruption, significantly boosting their reasoning capabilities in real-world scenarios.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References42

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Related Papers