Fondazione Bruno KesslerMar 11, 2026arXiv:2603.16924

SimulU: Training-free Policy for Long-form Simultaneous Speech-to-Speech Translation

Amirbek Djanibekov, L. Bentivogli, Matteo Negri, Sara Papi

AI Summary

The paper introduces SimulU, a novel training-free policy for long-form simultaneous speech-to-speech translation (SimulS2S). SimulU leverages cross-attention within pre-trained end-to-end models to manage input history and control output generation, enabling translation of continuous speech without task-specific training. Experiments on MuST-C across 8 languages demonstrate that SimulU achieves competitive or superior quality-latency trade-offs compared to strong cascaded baselines.

Key Contribution

Skip the training: SimulU achieves state-of-the-art simultaneous speech translation by cleverly exploiting pre-trained models, opening the door to truly plug-and-play multilingual communication.

Abstract

Simultaneous speech-to-speech translation (SimulS2S) is essential for real-time multilingual communication, with increasing integration into meeting and streaming platforms. Despite this, SimulS2S remains underexplored in research, where current solutions often rely on resource-intensive training procedures and operate on short-form, pre-segmented utterances, failing to generalize to continuous speech. To bridge this gap, we propose SimulU, the first training-free policy for long-form SimulS2S. SimulU adopts history management and speech output selection strategies that exploit cross-attention in pre-trained end-to-end models to regulate both input history and output generation. Evaluations on MuST-C across 8 languages show that SimulU achieves a better or comparable quality-latency trade-off against strong cascaded models. By eliminating the need for ad-hoc training, SimulU offers a promising path to end-to-end SimulS2S in realistic, long-form scenarios.

Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SimulU: Training-free Policy for Long-form Simultaneous Speech-to-Speech Translation

Related Papers