Feb 19, 2026arXiv:2602.17598

The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?

AI Summary

The paper investigates whether speech LLMs behave like automatic speech recognition (ASR) followed by a large language model (LLM) pipelines (ASR$\rightarrow$LLM cascades) on tasks solvable from a transcript. Through matched-backbone testing across four speech LLMs and six tasks, the authors demonstrate that Ultravox is statistically indistinguishable from its matched cascade, with literal text emerging in hidden states and text representations being causally necessary. However, Qwen2-Audio diverges, indicating that cascade equivalence is architecture-dependent and that speech LLMs can be less robust than ASR$\rightarrow$LLM cascades under noisy conditions.

Key Contribution

Most speech LLMs are just expensive ASR pipelines in disguise, and under noisy conditions, they're actually *worse* than the individual components.

Abstract

Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. Ultravox is statistically indistinguishable from its matched cascade ($κ{=}0.93$); logit lens reveals literal text emerging in hidden states; LEACE concept erasure confirms text representations are causally necessary in both architectures tested, collapsing accuracy to near-zero. Qwen2-Audio genuinely diverges, revealing cascade equivalence is architecture-dependent, not universal. For most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0 dB.

Interpretability & Mechanistic Interp Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?

Related Papers