Search papers, labs, and topics across Lattice.
This paper introduces Attention-guided Evidence Grounding (AEG), an end-to-end framework for Spoken Question Answering that leverages SpeechLLM's cross-modal attention to ground evidence in the latent space, avoiding cascaded ASR systems. To refine attention, they propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism. Experiments on SQuAD, HotpotQA, and MuSiQue show that AEG reduces hallucinations, improves efficiency, and outperforms cascaded baselines while reducing inference latency by 62%.
SpeechLLMs can be made significantly faster and more accurate at question answering by explicitly training their attention mechanisms to focus on relevant evidence.
Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.