Tsinghua AIHKUHUSTKUSCUTSoochowWHUMar 17, 2026arXiv:2603.16292

Attention-guided Evidence Grounding for Spoken Question Answering

Ke Yang, Bolin Chen, Yuejie Li, Yueying Hua, Jianhao Nie, Yueping He, Bowen Li, Chengjun Mao

AI Summary

This paper introduces Attention-guided Evidence Grounding (AEG), an end-to-end framework for Spoken Question Answering that leverages SpeechLLM's cross-modal attention to ground evidence in the latent space, avoiding cascaded ASR systems. To refine attention, they propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism. Experiments on SQuAD, HotpotQA, and MuSiQue show that AEG reduces hallucinations, improves efficiency, and outperforms cascaded baselines while reducing inference latency by 62%.

Key Contribution

SpeechLLMs can be made significantly faster and more accurate at question answering by explicitly training their attention mechanisms to focus on relevant evidence.

Abstract

Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Attention-guided Evidence Grounding for Spoken Question Answering

Related Papers