Search papers, labs, and topics across Lattice.
This study introduces Mel-LLM, an encoder-free Speech-LLM that processes Mel spectrogram patches directly without a dedicated speech encoder, allowing the model to learn speech-text alignment solely through its parameters. Extensive experiments on automatic speech recognition (ASR) and text-to-speech (TTS) tasks reveal that Mel-LLM achieves competitive ASR performance with minimal degradation compared to traditional encoder-initialized models, particularly benefiting from initialization with a multimodal checkpoint when data is scarce. Preliminary TTS results indicate the potential for a unified encoder-free architecture in autoregressive speech-text modeling, paving the way for more streamlined approaches in speech processing.
Encoder-free speech modeling can rival traditional methods, challenging the necessity of dedicated speech encoders in LLM architectures.
Recent speech-aware large language models (Speech-LLMs) rely on a pre-trained speech encoder to convert audio into semantic-rich representations consumable by LLM. In this work, instead, we explore: can an LLM learn to read Mel spectrogram directly without a dedicated speech encoder? We propose Mel-LLM, an encoder-free Speech-LLM that feeds lightly pre-processed Mel spectrogram patches directly into the LLM through a linear projection, allowing the LLM to learn speech-text alignment purely through its own parameters. We conduct extensive experiments on both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. For ASR, we evaluate on the OpenASR leaderboard public sets and production-level scaling experiments, demonstrating that the encoder-free solution achieves competitive performance with only limited degradation compared to encoder-initialized counterparts. We find that when data is limited, initialization from a multimodal checkpoint (Phi-4-MM) is crucial for maintaining performance. We also present ablation studies revealing which LLM layers are less relevant to speech encoding. For TTS, we show preliminary results with a next-token VAE approach. While TTS performance is not yet optimal, these results establish the feasibility of a fully unified encoder-free architecture for autoregressive speech-text modeling.