CMU MLAISTKeioNAISTTU MunichUTokyoJun 11, 2026arXiv:2606.13322

Low-Latency Real-Time Audio Game Commentary System via LLM-Based Parallel Text Generation

Ryota Kawamatsu, Anum Afzal, Yuki Saito, Shinnosuke Takamichi, Graham Neubig, Katsuhito Sudoh, Hiroya Takamura, Tatsuya Ishigaki

AI Summary

This paper introduces a low-latency real-time audio commentary system that utilizes parallel text generation to produce spoken commentary directly from live gameplay video. By addressing the latency bottleneck inherent in traditional sequential pipelines, the system reduces inter-utterance silence from 9.6 seconds to just 0.3 seconds, significantly enhancing the flow of commentary. User studies with experienced gamers demonstrate that this approach not only improves timing patterns but also aligns more closely with professional speaking rhythms, leading to a better overall experience.

Key Contribution

Reducing inter-utterance silence from 9.6 seconds to 0.3 seconds transforms the quality of real-time game commentary, making it feel more natural and engaging.

Abstract

We present a low-latency real-time audio game commentary system that generates spoken commentary directly from live gameplay video. In this end-to-end setting, a key bottleneck is accumulated waiting time; conventional pipelines capture frames, generate text, and synthesize speech sequentially for each utterance, and do not request the next generation until speech playback has completed. This strict sequentiality causes long and unnatural silence between utterances. To address this latency bottleneck, our system runs text generation in parallel with speech playback and buffers multiple candidate utterances ahead of time, enabling immediate synthesis at playback boundaries. Experiments on fast-paced game videos show that our parallel design reduces the mean inter-utterance silence from 9.6 seconds to 0.3 seconds compared to sequential baselines. It also improves similarity to professional speaking--silence timing patterns by over 40 %, and a user study with 120 experienced game players confirms significantly improved perceived speaking rhythm. Our demo video is available at: https://youtu.be/pmrRUlvav8M.

Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References21

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Low-Latency Real-Time Audio Game Commentary System via LLM-Based Parallel Text Generation

Related Papers