Search papers, labs, and topics across Lattice.
The paper introduces Trans-Chunk BiMamba (TC-BiMamba), a novel architecture for unified streaming and non-streaming automatic speech recognition (ASR) that addresses the limitations of existing BiMamba-based streaming methods which are restricted to fixed chunk sizes. TC-BiMamba employs a trans-chunk mechanism to train bidirectional sequences offline with dynamic chunk sizes, enabling a single model to handle both offline and streaming decoding with varying latency requirements. Experiments demonstrate that TC-BiMamba achieves a 1.3x training speedup, reduces memory consumption by 50%, and improves ASR performance compared to chunk-wise processing, while also outperforming U2++ and matching LC-BiMamba with a smaller model size.
Ditch fixed chunk sizes: TC-BiMamba unlocks faster, more memory-efficient training for bidirectional Mamba ASR models, enabling unified streaming and non-streaming decoding.
This work investigates bidirectional Mamba (BiMamba) for unified streaming and non-streaming automatic speech recognition (ASR). Dynamic chunk size training enables a single model for offline decoding and streaming decoding with various latency settings. In contrast, existing BiMamba based streaming method is limited to fixed chunk size decoding. When dynamic chunk size training is applied, training overhead increases substantially. To tackle this issue, we propose the Trans-Chunk BiMamba (TC-BiMamba) for dynamic chunk size training. Trans-Chunk mechanism trains both bidirectional sequences in an offline style with dynamic chunk size. On the one hand, compared to traditional chunk-wise processing, TC-BiMamba simultaneously achieves 1.3 times training speedup, reduces training memory by 50%, and improves model performance since it can capture bidirectional context. On the other hand, experimental results show that TC-BiMamba outperforms U2++ and matches LC-BiMmaba with smaller model size.