Mar 12, 2026arXiv:2603.11578

Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

Roman Koshkin, Je， Haesung, Lianbo Liu, Hao Shi, Meng Zhao, Yusuke Fujita, Yui Sudo

AI Summary

The paper introduces Hikari, a novel end-to-end model for simultaneous speech-to-text translation and streaming transcription that avoids human-engineered heuristics by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. Hikari also incorporates Decoder Time Dilation to reduce autoregressive overhead and a supervised fine-tuning strategy to improve recovery from delays. Experiments on English-to-Japanese, German, and Russian demonstrate that Hikari achieves state-of-the-art BLEU scores in both low- and high-latency regimes compared to existing baselines.

Key Contribution

Ditch the heuristics: Hikari achieves state-of-the-art simultaneous speech translation by learning READ/WRITE decisions directly through a probabilistic WAIT token.

Abstract

Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. We also introduce Decoder Time Dilation, a mechanism that reduces autoregressive overhead and ensures a balanced training distribution. Additionally, we present a supervised fine-tuning strategy that trains the model to recover from delays, significantly improving the quality-latency trade-off. Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores in both low- and high-latency regimes, outperforming recent baselines.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References50

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

Related Papers