Jun 8, 2026arXiv:2606.09366

Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMs

Ming-Hao Hsu, Yuxuan Hu, Shujie Liu, Jinyu Li, Yan Lu, Zhizheng Wu

AI Summary

This paper introduces Convex Gate (C-Gate), a novel speech-to-LLM interface that constrains speech representations within the input embedding manifold of a frozen large language model, allowing for effective integration of continuous acoustic signals. By representing each speech frame as a convex combination of token embeddings, C-Gate achieves significant improvements in automatic speech recognition and emotion recognition tasks, with a 48.7% relative reduction in Word Error Rate on the LibriSpeech dataset. The findings highlight that the geometric properties of embedding trajectories, rather than discrete token identities, are crucial for optimizing performance in multimodal applications involving speech and language models.

Key Contribution

Geometry, not token discreteness, is the key to unlocking superior performance in speech-to-LLM integration.

Abstract

Large language models (LLMs) provide a powerful reasoning backbone for speech understanding, but integrating continuous acoustic signals into a frozen LLM remains challenging. Existing speech-to-LLM interfaces typically operate at two extremes: either enforcing near-discrete token alignment, which benefits transcription but loses paralinguistic information, or learning unconstrained continuous representations, which can drift away from the LLM's input space and degrade autoregressive decoding. In this work, we propose Convex Gate (C-Gate), a speech-to-LLM bridge that constrains all speech representations to lie within the LLM's input embedding manifold with an architectural convex-hull constraint. Concretely, each frame is represented as a convex combination of token embeddings, ensuring compatibility with the pretrained LLM while preserving continuous expressivity. Across automatic speech recognition (ASR) and emotion recognition, C-Gate achieves strong joint performance, improving LibriSpeech WER by up to 48.7% relative while matching or exceeding single-task emotion accuracy. Beyond performance, our analysis reveals a key insight: information is not carried by discrete token identities, but by time-resolved trajectories in the embedding space. Causal interventions confirm that both the trajectory structure and alignment to the pretrained embedding manifold are critical for performance. These results suggest that geometry, rather than token discreteness, is the fundamental design factor in speech-to-LLM interfaces, and provide a controlled regime for studying multimodal integration in frozen LLMs. We release the checkpoint, per-sample outputs, mechanism dumps, and intervention suite for replication.

Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMs

Related Papers