CUHKM-A-PZJUApr 25, 2025arXiv:2504.18053

DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models

Jianyu Liu, Hangyu Guo, Ranjie Duan, Xingyuan Bu, Yancheng He, Shilong Li, Hui Huang, Jiaheng Liu, Yucheng Wang, Chenchen Jing, Xingwei Qu, Xiao Zhang, Pei Wang, Yanan Wu, Jihao Gu, Yangguang Li, Jianke Zhu

AI Summary

The paper analyzes and disentangles risks in Multimodal Large Language Models (MLLMs) by examining step-by-step reasoning within multimodal inputs to improve risk awareness. Based on this risk disentanglement, they introduce DREAM, a method that uses supervised fine-tuning and Reinforcement Learning from AI Feedback (RLAIF) to enhance safety alignment in MLLMs. Experiments demonstrate that DREAM improves safety during inference and training, achieving a 16.17\% improvement in the SIUO safe&effective score compared to GPT-4V, without sacrificing performance on normal tasks.

Key Contribution

MLLMs can be made significantly safer without sacrificing performance by disentangling risks in multimodal inputs and using RLAIF, outperforming even GPT-4V by 16% on safety benchmarks.

Abstract

Multimodal Large Language Models (MLLMs) pose unique safety challenges due to their integration of visual and textual data, thereby introducing new dimensions of potential attacks and complex risk combinations. In this paper, we begin with a detailed analysis aimed at disentangling risks through step-by-step reasoning within multimodal inputs. We find that systematic multimodal risk disentanglement substantially enhances the risk awareness of MLLMs. Via leveraging the strong discriminative abilities of multimodal risk disentanglement, we further introduce \textbf{DREAM} (\textit{\textbf{D}isentangling \textbf{R}isks to \textbf{E}nhance Safety \textbf{A}lignment in \textbf{M}LLMs}), a novel approach that enhances safety alignment in MLLMs through supervised fine-tuning and iterative Reinforcement Learning from AI Feedback (RLAIF). Experimental results show that DREAM significantly boosts safety during both inference and training phases without compromising performance on normal tasks (namely oversafety), achieving a 16.17\% improvement in the SIUO safe\&effective score compared to GPT-4V. The data and code are available at https://github.com/Kizna1ver/DREAM.

Constitutional AI & AI Ethics Multimodal Models Red-Teaming & Adversarial Robustness

Citation Metrics

Citations5

Influential citations0

References68

Year2025

VenueNorth American Chapter of the Association for Computational Linguistics

Related Papers

Finding related papers...

Search

DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models

Related Papers