Search papers, labs, and topics across Lattice.
This paper introduces RedEdit, a black-box red-teaming agent designed to exploit vulnerabilities in image safety classifiers through a novel approach that combines a Vision-Language Model (VLM) for generating targeted edits with Monte Carlo Tree Search (MCTS) for optimizing edit sequences. The study reveals that, on average, fewer than two edits can allow 76.2% of unsafe images to bypass detection while maintaining 93.0% of their malicious semantics, highlighting a significant gap in current content moderation systems. These findings underscore the urgent need for improved defenses against user-style malicious image editing, which poses a real threat to online safety.
Fewer than two edits can enable 76.2% of unsafe images to bypass safety classifiers while retaining their malicious intent, exposing critical vulnerabilities in current moderation systems.
Image safety classifiers serve as a critical component of contemporary content moderation systems on the internet. However, their resilience against user-style malicious image editing remains underexplored. Such behaviors are highly prevalent in daily scenarios but difficult to fully reproduce. To explore this vulnerability, we introduce RedEdit, a novel black-box red-teaming agent that formulates photo-editing evasion as a combinatorial search problem over edit-tool sequences. It adopts a Vision-Language-Model (VLM)-based proposer to generate semantically targeted candidate edits and a Monte Carlo Tree Search (MCTS) planner to prioritize promising edit paths while backtracking from ineffective ones. Together, the proposer and planner instantiate two key capabilities of human attackers, i.e., domain knowledge and iterative backtracking, respectively, to reproduce this practical threat. Our extensive experiments on UnsafeBench reveal profound systemic vulnerabilities: fewer than two edits on average enable 76.2% of unsafe images to evade detectors, while retaining 93.0% malicious semantics, meaning that such manipulated content remains perceptually malicious to humans while easily bypassing automated moderation. We therefore appeal to the community for more attention to this overlooked practical threat.