NVIDIAUniversity of SouthernJan 18, 2026arXiv:2601.12263

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

Yixuan Du, Chenxiao Yu, Haoyan Xu, Ziyi Wang, Yue Zhao, Xiyang Hu

AI Summary

This paper introduces Multimodal Generative Engine Optimization (MGEO), a novel adversarial attack framework that exploits vulnerabilities in VLM-based product search ranking systems. MGEO jointly optimizes imperceptible image perturbations and fluent textual suffixes to unfairly promote a target product, leveraging the cross-modal coupling within VLMs. Experiments on real-world datasets demonstrate that MGEO significantly outperforms unimodal attacks, highlighting the vulnerability of VLMs to coordinated multimodal manipulation.

Key Contribution

VLMs, typically praised for their multimodal synergy, can be easily weaponized to manipulate search rankings via imperceptible image perturbations and fluent textual suffixes, outperforming unimodal attacks.

Abstract

Vision-Language Models (VLMs) are rapidly replacing unimodal encoders in modern retrieval and recommendation systems. While their capabilities are well-documented, their robustness against adversarial manipulation in competitive ranking scenarios remains largely unexplored. In this paper, we uncover a critical vulnerability in VLM-based product search: multimodal ranking attacks. We present Multimodal Generative Engine Optimization (MGEO), a novel adversarial framework that enables a malicious actor to unfairly promote a target product by jointly optimizing imperceptible image perturbations and fluent textual suffixes. Unlike existing attacks that treat modalities in isolation, MGEO employs an alternating gradient-based optimization strategy to exploit the deep cross-modal coupling within the VLM. Extensive experiments on real-world datasets using state-of-the-art models demonstrate that our coordinated attack significantly outperforms text-only and image-only baselines. These findings reveal that multimodal synergy, typically a strength of VLMs, can be weaponized to compromise the integrity of search rankings without triggering conventional content filters.

Interpretability & Mechanistic Interp Multimodal Models

Citation Metrics

Citations0

Influential citations0

References12

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

Related Papers