USTCApr 9, 2026arXiv:2604.08405

SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation

Wenli Zhang, Wenli Zhang, Xianglong Shi, Xian Shi, Sirui Zhao, Sirui Zhao, Xinqi Chen, Guo Cheng, Guo Cheng, Yifan Xu, Yifan Xu, Tong Xu, Tong Xu, Yongxian Liao, Yong Liao

AI Summary

This paper introduces SyncBreaker, a multimodal adversarial attack framework targeting diffusion-based audio-driven talking head generation. SyncBreaker jointly perturbs both portrait and audio inputs, using Multi-Interval Sampling (MIS) for image stream perturbation to enforce static reference portrait generation and Cross-Attention Fooling (CAF) for audio stream perturbation to suppress audio-conditioned cross-attention. Experiments demonstrate that SyncBreaker effectively degrades lip synchronization and facial dynamics compared to unimodal attacks, while maintaining perceptual quality and robustness against purification.

Key Contribution

Existing defenses against talking-head manipulation are easily bypassed: SyncBreaker shows how to effectively degrade lip sync and facial dynamics by jointly perturbing audio and video.

Abstract

Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses interval-specific audio-conditioned cross-attention responses. Both streams are optimized independently and combined at inference time to enable flexible deployment. We evaluate SyncBreaker in a white-box proactive protection setting. Extensive experiments demonstrate that SyncBreaker more effectively degrades lip synchronization and facial dynamics than strong single-modality baselines, while preserving input perceptual quality and remaining robust under purification. Code: https://github.com/kitty384/SyncBreaker.

Multimodal Models Red-Teaming & Adversarial Robustness Speech & Audio

Citation Metrics

Citations0

Influential citations0

References55

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation

Related Papers