Search papers, labs, and topics across Lattice.
This paper introduces a novel approach to modeling overlapped speech using shuffle products and partial order finite-state automata (FSAs) for alignment and speaker-attributed transcription. The method trains on the total score of these FSAs, marginalizing over possible serializations of overlapping sequences and incorporating temporal constraints to reduce graph size. Experiments on synthetic LibriSpeech overlaps demonstrate the algorithm's ability to perform single-pass alignment of multi-talker recordings, a capability not previously achieved.
Achieve single-pass alignment of multi-talker speech – a feat previously impossible – by modeling overlaps as shuffles.
We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.