Search papers, labs, and topics across Lattice.
This paper introduces Multimodal Generative Engine Optimization (MGEO), a novel adversarial attack framework that exploits vulnerabilities in VLM-based product search ranking systems. MGEO jointly optimizes imperceptible image perturbations and fluent textual suffixes to unfairly promote a target product, leveraging the cross-modal coupling within VLMs. Experiments on real-world datasets demonstrate that MGEO significantly outperforms unimodal attacks, highlighting the vulnerability of VLMs to coordinated multimodal manipulation.
VLMs, typically praised for their multimodal synergy, can be easily weaponized to manipulate search rankings via imperceptible image perturbations and fluent textual suffixes, outperforming unimodal attacks.
Vision-Language Models (VLMs) are rapidly replacing unimodal encoders in modern retrieval and recommendation systems. While their capabilities are well-documented, their robustness against adversarial manipulation in competitive ranking scenarios remains largely unexplored. In this paper, we uncover a critical vulnerability in VLM-based product search: multimodal ranking attacks. We present Multimodal Generative Engine Optimization (MGEO), a novel adversarial framework that enables a malicious actor to unfairly promote a target product by jointly optimizing imperceptible image perturbations and fluent textual suffixes. Unlike existing attacks that treat modalities in isolation, MGEO employs an alternating gradient-based optimization strategy to exploit the deep cross-modal coupling within the VLM. Extensive experiments on real-world datasets using state-of-the-art models demonstrate that our coordinated attack significantly outperforms text-only and image-only baselines. These findings reveal that multimodal synergy, typically a strength of VLMs, can be weaponized to compromise the integrity of search rankings without triggering conventional content filters.