Search papers, labs, and topics across Lattice.
Speech-Omni-Lite is introduced as a cost-efficient framework for endowing pre-trained Visual-Language (VL) models with speech understanding and generation by adding lightweight, trainable speech projector and token generator modules while freezing the VL backbone. To address the lack of spoken QA data, the authors propose a data construction strategy to generate Question-Text Answer-Text-Speech (QTATS) data from ASR speech-text pairs. Experiments demonstrate that Speech-Omni-Lite achieves comparable spoken QA performance to omni-models trained on orders of magnitude more speech data, and the learned speech modules transfer well across VL backbones.
Get state-of-the-art spoken QA performance by adding lightweight speech modules to frozen VL models and training on synthetically generated speech data, sidestepping the need for massive multimodal datasets.
While large-scale omni-models have demonstrated impressive capabilities across various modalities, their strong performance heavily relies on massive multimodal data and incurs substantial computational costs. This work introduces Speech-Omni-Lite, a cost-efficient framework for extending pre-trained Visual-Language (VL) backbones with speech understanding and generation capabilities, while fully preserving the backbones' vision-language performance. Specifically, the VL backbone is equipped with two lightweight, trainable plug-and-play modules, a speech projector and a speech token generator, while keeping the VL backbone fully frozen. To mitigate the scarcity of spoken QA corpora, a low-cost data construction strategy is proposed to generate Question-Text Answer-Text-Speech (QTATS) data from existing ASR speech-text pairs, facilitating effective speech generation training. Experimental results show that, even with only thousands of hours of speech training data, Speech-Omni-Lite achieves excellent spoken QA performance, which is comparable to omni-models trained on millions of hours of speech data. Furthermore, the learned speech modules exhibit strong transferability across VL backbones.