NTU TaiwanMar 10, 2026arXiv:2603.09215

SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models

AI Summary

The paper introduces SPAR-K, a modality-aware early exit framework for interleaved spoken language models (SLMs) that aims to accelerate inference by exiting speech positions at intermediate transformer layers. SPAR-K employs a speech alternating-depth schedule with periodic full-depth "refresh" steps to maintain performance and mitigate distribution shift. Experiments on Step-Audio-2-mini and GLM-4-Voice show that SPAR-K preserves question-answering accuracy (max 0.82% drop) while reducing average speech decoding depth by up to 11% and 5% respectively, with negligible impact on MOS and WER.

Key Contribution

Forget confidence scores: a modality-aware early exit strategy for spoken language models slashes decoding costs without sacrificing accuracy or perceptual quality, revealing that speech tokens require specialized handling compared to text.

Abstract

Interleaved spoken language models (SLMs) alternately generate text and speech tokens, but decoding at full transformer depth for every step becomes costly, especially due to long speech sequences. We propose SPAR-K, a modality-aware early exit framework designed to accelerate interleaved SLM inference while preserving perceptual quality. SPAR-K introduces a speech alternating-depth schedule: most speech positions exit at a fixed intermediate layer, while periodic full-depth"refresh"steps mitigate distribution shift due to early exit. We evaluate our framework using Step-Audio-2-mini and GLM-4-Voice across four datasets spanning reasoning, factual QA, and dialogue tasks, measuring performance in terms of ASR transcription accuracy and perceptual quality. Experimental results demonstrate that SPAR-K largely preserves question-answering accuracy with a maximum accuracy drop of 0.82\% while reducing average speech decoding depth by up to 11\% on Step-Audio-2-mini and 5\% on GLM-4-Voice, both with negligible changes in MOS and WER and no auxiliary computation overhead. We further demonstrate that confidence-based early exit strategies, widely used in text LLMs, are suboptimal for SLMs, highlighting that the unique statistical nature of speech tokens necessitates a specialized early exit design.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Speech & Audio

Citation Metrics

Citations0

Influential citations0

References34

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models

Related Papers