Jun 9, 2026arXiv:2606.10571

Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction

Lijia Yu, Jiuxin Cao, Yuchen Qiang, Changhao Chen, Yifei Huang, Bo Liu

AI Summary

This paper introduces DeBias-Attack, a novel approach to enhance the transferability of adversarial examples in Vision-Language Pre-training (VLP) models by addressing surrogate-specific bias in adversarial optimization. By maintaining two perturbation branches—one optimizing on the original image and the other on a weak-semantic image—the method effectively corrects the bias that arises from surrogate model dependencies. Experimental results demonstrate that DeBias-Attack significantly improves adversarial transferability across various VLP models and tasks, showcasing its robustness against both open-source and closed-source multimodal large language models.

Key Contribution

Correcting surrogate-specific bias in adversarial optimization can dramatically enhance the transferability of attacks across Vision-Language Pre-training models.

Abstract

Adversarial examples reveal vulnerabilities in Vision-Language Pre-training (VLP) models and provide insights for improving robustness. A key property is cross-model transferability, which enables transfer-based black-box attacks. However, existing attacks often rely heavily on the surrogate model, causing cross-model performance drops. One reason is that adversarial optimization may follow surrogate model responses more than input semantics, making the update direction effective on the surrogate but less transferable to unseen targets. We refer to this dependency as surrogate-specific bias. Motivated by this observation, DeBias-Attack improves transferability by correcting surrogate-specific bias in adversarial optimization directions. It maintains two perturbation branches. The main branch optimizes a perturbation on the original image and obtains the adversarial gradient used to disrupt image-text alignment. The reference branch optimizes a perturbation on a weak-semantic image constructed from the dataset mean image with small Gaussian noise resampled at each iteration. Since this weak-semantic image contains little clear visual content, its optimization reflects surrogate responses more than image semantics, and its reference gradient estimates surrogate-specific bias. DeBias-Attack removes the aligned projection of the main gradient on the reference gradient before updating the adversarial image, then performs context-aware text substitution using the updated adversarial image. DeBias-Attack is the first transfer-based VLP attack that corrects surrogate-specific bias through gradient correction. Experiments show strong performance across VLP models, downstream tasks, and open-source and closed-source multimodal large language models.

Multimodal Models Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction

Related Papers