Apr 5, 2026arXiv:2604.03995

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

AI Summary

This paper introduces Multi-Modal Typography, a novel attack strategy that leverages coordinated typographic perturbations across audio, visual, and text modalities to mislead multi-modal large language models (MLLMs). They find that cross-modal attacks are significantly more effective than unimodal attacks, achieving an 83.43% attack success rate compared to 34.93% for single-modality attacks. The study highlights a critical vulnerability in MLLMs across various tasks, models, and benchmarks related to common-sense reasoning and content moderation.

Key Contribution

Coordinated typographic attacks across modalities can more than double the success rate of misleading audio-visual MLLMs compared to single-modality attacks.

Abstract

As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = $83.43\%$ vs $34.93\%$).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.

Multimodal Models Red-Teaming & Adversarial Robustness Speech & Audio

Citation Metrics

Citations0

Influential citations0

References36

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

Related Papers