May 21, 2026arXiv:2605.22273

Exposing Vulnerabilities in Visible-Infrared VLMs: A Unified Geometric Adversarial Framework with Cross-Task Transferability

Xiang Chen, Yuxian Dong, Chao Li, Chengyin Hu, Jiaju Han, Fengyu Zhang, Yiwei Wei, Jiahuan Long, Jiujiang Guo

AI Summary

This paper introduces CFGPatch, a novel adversarial attack framework targeting vision-language models (VLMs) in visible-infrared (VIS-IR) scenarios. CFGPatch leverages curved-edge fractal geometry and Fraser-spiral rendering to generate adversarial patches that disrupt both shape and texture interpretation in VIS-IR images. Experiments demonstrate that CFGPatch outperforms standard patch baselines in attack effectiveness and robustness, with strong cross-task transferability to image captioning and visual question answering.

Key Contribution

Even with robust training techniques like EOT, a carefully crafted adversarial patch can reliably fool VIS-IR VLMs and transfer across tasks like classification, captioning, and VQA.

Abstract

Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, but their adversarial robustness in visible-infrared (VIS-IR) scenarios remains underexplored. This gap is critical because VIS-IR sensing is widely used in real-world perception systems to support reliable understanding under challenging imaging conditions. To address this cross-modal threat setting, we propose CFGPatch, a curved-edge fractal geometric adversarial patch framework for attacking VIS-IR VLMs. CFGPatch builds on triangular fractal geometry and replaces rigid straight-edged primitives with Bezier-curved elements, preserving multi-scale fractal self-similarity while introducing smoother contours, richer directional variation, and more flexible shape deformation. In addition, we design a modality-specific Fraser-spiral rendering mechanism to inject fine-grained texture distortions and misleading perceptual cues into visible and infrared images. By coupling global curved-fractal geometry with local spiral-based appearance interference, CFGPatch disrupts both shape perception and texture interpretation. We further adopt expectation over transformation (EOT) to improve robustness against common image-level transformations. Extensive experiments show that CFGPatch effectively fools VIS-IR VLMs and consistently outperforms standard patch baselines in attack effectiveness and robustness. Moreover, adversarial samples optimized for zero-shot classification transfer well to image captioning and visual question answering, demonstrating strong cross-task transferability and generalizability across downstream tasks.

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Exposing Vulnerabilities in Visible-Infrared VLMs: A Unified Geometric Adversarial Framework with Cross-Task Transferability

Related Papers