Search papers, labs, and topics across Lattice.
This paper introduces SyncBreaker, a multimodal adversarial attack framework targeting diffusion-based audio-driven talking head generation. SyncBreaker jointly perturbs both portrait and audio inputs, using Multi-Interval Sampling (MIS) for image stream perturbation to enforce static reference portrait generation and Cross-Attention Fooling (CAF) for audio stream perturbation to suppress audio-conditioned cross-attention. Experiments demonstrate that SyncBreaker effectively degrades lip synchronization and facial dynamics compared to unimodal attacks, while maintaining perceptual quality and robustness against purification.
Existing defenses against talking-head manipulation are easily bypassed: SyncBreaker shows how to effectively degrade lip sync and facial dynamics by jointly perturbing audio and video.
Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses interval-specific audio-conditioned cross-attention responses. Both streams are optimized independently and combined at inference time to enable flexible deployment. We evaluate SyncBreaker in a white-box proactive protection setting. Extensive experiments demonstrate that SyncBreaker more effectively degrades lip synchronization and facial dynamics than strong single-modality baselines, while preserving input perceptual quality and remaining robust under purification. Code: https://github.com/kitty384/SyncBreaker.