Search papers, labs, and topics across Lattice.
The paper investigates whether speech LLMs behave like automatic speech recognition (ASR) followed by a large language model (LLM) pipelines (ASR$\rightarrow$LLM cascades) on tasks solvable from a transcript. Through matched-backbone testing across four speech LLMs and six tasks, the authors demonstrate that Ultravox is statistically indistinguishable from its matched cascade, with literal text emerging in hidden states and text representations being causally necessary. However, Qwen2-Audio diverges, indicating that cascade equivalence is architecture-dependent and that speech LLMs can be less robust than ASR$\rightarrow$LLM cascades under noisy conditions.
Most speech LLMs are just expensive ASR pipelines in disguise, and under noisy conditions, they're actually *worse* than the individual components.
Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. Ultravox is statistically indistinguishable from its matched cascade ($κ{=}0.93$); logit lens reveals literal text emerging in hidden states; LEACE concept erasure confirms text representations are causally necessary in both architectures tested, collapsing accuracy to near-zero. Qwen2-Audio genuinely diverges, revealing cascade equivalence is architecture-dependent, not universal. For most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0 dB.