Search papers, labs, and topics across Lattice.
This study investigates the impact of speech token design on the reasoning capabilities of spoken dialogue models that rely on text-based LLMs. By addressing the temporal-granularity mismatch between speech and text, the authors introduce a factorized FSQ and a lightweight non-autoregressive audio LM head, enabling efficient processing of speech at varying frame rates. The key finding reveals that a frame rate of 4.17 Hz with intermediate-layer representation alignment optimally enhances speech question-answering performance.
Speech QA models achieve peak performance at 4.17 Hz, revealing the critical role of frame rate and representation alignment in bridging the gap between speech and text reasoning.
Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics, diluting per-token semantic density and weakening text-native reasoning dynamics. We study speech token design as a representation selection problem and sweep frame rates under a frozen LLM backbone with a fixed information rate. To make low frame rates feasible, we introduce factorized FSQ and a lightweight non-autoregressive audio LM head, scaling capacity to nearly 300\,bits/frame without sacrificing efficient prediction. With the bottleneck removed, we sweep frame rates (50$\rightarrow$2.08\,Hz) and alignment depth, and observe a consistent best regime for speech QA at 4.17\,Hz with intermediate-layer representation alignment.