CUHKHKUSTHong Kong Baptist UniversityIndependent ResearcherPolyUUniversity of SurreyJun 10, 2026arXiv:2606.12199

Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation

Zhen Ye, Xu Tan, Yiming Li, Guangyan Zhang, Chimin Chan, Haohe Liu, Zhengxi Liu, Hongzhan Lin, Zheqi Dai, Xinshen Zhang, Peiwen Sun, Qiuqiang Kong, Wei Xue

AI Summary

This study investigates the impact of speech token design on the reasoning capabilities of spoken dialogue models that rely on text-based LLMs. By addressing the temporal-granularity mismatch between speech and text, the authors introduce a factorized FSQ and a lightweight non-autoregressive audio LM head, enabling efficient processing of speech at varying frame rates. The key finding reveals that a frame rate of 4.17 Hz with intermediate-layer representation alignment optimally enhances speech question-answering performance.

Key Contribution

Speech QA models achieve peak performance at 4.17 Hz, revealing the critical role of frame rate and representation alignment in bridging the gap between speech and text reasoning.

Abstract

Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics, diluting per-token semantic density and weakening text-native reasoning dynamics. We study speech token design as a representation selection problem and sweep frame rates under a frozen LLM backbone with a fixed information rate. To make low frame rates feasible, we introduce factorized FSQ and a lightweight non-autoregressive audio LM head, scaling capacity to nearly 300\,bits/frame without sacrificing efficient prediction. With the bottleneck removed, we sweep frame rates (50$\rightarrow$2.08\,Hz) and alignment depth, and observe a consistent best regime for speech QA at 4.17\,Hz with intermediate-layer representation alignment.

Multimodal Models Reasoning & Chain-of-Thought Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation

Related Papers