Search papers, labs, and topics across Lattice.
The paper introduces AuRA, a novel method that distills audio understanding directly into large language models (LLMs) by employing a lightweight audio embedding layer for layer-wise distillation from an ASR encoder to a LoRA-adapted LLM. This approach addresses the limitations of traditional cascaded ASR-LLM pipelines and multimodal training by enabling efficient parallel inference and tighter integration of speech and language processing. Experimental results demonstrate that AuRA outperforms existing methods across various speech-language benchmarks in both effectiveness and efficiency, highlighting its potential for real-time applications.
AuRA achieves superior performance in speech-language tasks by seamlessly integrating audio understanding into LLMs, outperforming traditional methods in both speed and accuracy.
Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pretrained components, enable native speech-language interaction, or offer lightweight adaptation, they often suffer from transcript-interface latency, costly multimodal training, or sequential speech-language coupling. To address these limitations, we present AuRA, a method that distills audio encoding capability into the LLM. Specifically, AuRA feeds the same speech input to an ASR encoder (as a teacher) and a LoRA-adapted LLM (as a student) through a lightweight audio embedding layer, and uses layer-wise distillation to align the student's hidden states with corresponding teacher representations, thereby internalizing speech representations into lightweight LLM-side adaptations. Compared with cascaded and serial bridge methods, AuRA enables tighter speech-language joint modeling and efficient parallel end-to-end inference, while also reusing pretrained speech and language models rather than requiring large-scale multimodal training. On multiple speech-language benchmarks, AuRA consistently outperforms cascaded systems, speech-to-LLM adaptation baselines, and large-scale speech-language and multimodal models in both effectiveness and efficiency.