Search papers, labs, and topics across Lattice.
This paper investigates the effectiveness of Contrastive Decoding (CD) in enhancing Large Audio Language Models (LALMs) by evaluating four different CD strategies across various LALM architectures. The study finds that Audio-Aware Decoding and Audio Contrastive Decoding are the most effective, but their performance is model-dependent. Using a Transition Matrix framework, the authors show that CD corrects errors related to audio absence and uncertainty, but struggles with flawed reasoning and confident misassertions, providing guidelines for matching LALM architectures with appropriate CD strategies.
Contrastive Decoding's power-up for audio language models hinges on fixing specific error types, like uncertainty and audio absence, but don't expect it to magically fix flawed reasoning.
While Contrastive Decoding (CD) has proven effective at enhancing Large Audio Language Models (LALMs), the underlying mechanisms driving its success and the comparative efficacy of different strategies remain unclear. This study systematically evaluates four distinct CD strategies across diverse LALM architectures. We identify Audio-Aware Decoding and Audio Contrastive Decoding as the most effective methods. However, their impact varies significantly by model. To explain this variability, we introduce a Transition Matrix framework to map error pattern shifts during inference. Our analysis demonstrates that CD reliably rectifies errors in which models falsely claim an absence of audio or resort to uncertainty-driven guessing. Conversely, it fails to correct flawed reasoning or confident misassertions. Ultimately, these findings provide a clear guideline for determining which LALM architectures are most suitable for CD enhancement based on their baseline error profiles.