Search papers, labs, and topics across Lattice.
This paper investigates the modality gap between speech and text in end-to-end speech LLMs by analyzing the evolution of speech and text representations across layers using cross-layer Centered Kernel Alignment (CKA). The study reveals a broad cross-layer alignment band for speech representations due to redundancy in the acoustic signal, and demonstrates that simple statistical calibration at the input layer is ineffective, suggesting the gap is more than a distribution shift. The findings highlight the challenge of condensing redundant speech information into stable, late-layer decisions, pointing towards token- or temporal-level solutions.
Speech LLMs struggle not just from a simple input distribution shift, but from the challenge of condensing redundant acoustic information into stable, high-level semantic representations.
Recent advancements in Large Speech-Language Models have significantly bridged the gap between acoustic signals and linguistic understanding. However, a persistent performance disparity remains in speech-based input tasks compared to direct text inference. In this paper, we investigate the dynamic roots of this modality gap beyond static geometric alignment, analyzing how speech and text representations evolve layer-by-layer. We evaluate four open-weight end-to-end models on SpeechMMLU and VoiceBench BBH. Using cross-layer CKA analysis with speech-text token alignment, we find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech where semantic content spans multiple frames. We show that these alignment patterns are structurally stable across different analysis configurations. Crucially, simple statistical calibration is insufficient and can be detrimental when applied at the input layer, indicating that the modality gap is not a mere distribution shift. Overall, our results suggest that the bottleneck lies in condensing redundant speech into stable late-layer decisions, motivating future solutions that operate at the token or temporal granularity instead of feature-level matching.