NVIDIANTU TaiwanApr 19, 2026arXiv:2604.17248

VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

Yi-Cheng Lin, Yusuke Hirota, Hung-yi Lee

AI Summary

The paper introduces VIBE, a novel framework for evaluating generative bias in Large Audio-Language Models (LALMs) using open-ended tasks with real-world human speech recordings. VIBE moves beyond synthetic speech and multiple-choice questions to assess how stereotypical associations emerge organically in tasks like personalized recommendations. Experiments on 11 state-of-the-art LALMs demonstrate systematic biases, with gender cues eliciting larger distributional shifts than accent cues, highlighting the reproduction of social stereotypes.

Key Contribution

LALMs reveal their hidden biases when you let them generate freely from real human voices, and gender stereotypes are more pronounced than accent biases.

Abstract

Large Audio-Language Models (LALMs) are increasingly integrated into daily applications, yet their generative biases remain underexplored. Existing speech fairness benchmarks rely on synthetic speech and Multiple-Choice Questions (MCQs), both offering a fragmented view of fairness. We propose VIBE, a framework that evaluates generative bias through open-ended tasks such as personalized recommendations, using real-world human recordings. Unlike MCQs, our method allows stereotypical associations to manifest organically without predefined options, making it easily extensible to new tasks. Evaluating 11 state-of-the-art LALMs reveals systematic biases in realistic scenarios. We find that gender cues often trigger larger distributional shifts than accent cues, indicating that current LALMs reproduce social stereotypes.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

Related Papers