NVIDIAAV (audio in / cascaded avatar out)May 28, 2026arXiv:2605.30256

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

Amrita Mazumdar, Seonwook Park, Rajarshi Roy, N. Srihari, Shengze Wang, Yuhao Zhou, Julia Wang, Koki Nagano, Shalini De Mello

AI Summary

The paper introduces VideoFDB, a new benchmark for evaluating full-duplex audio-visual conversational agents, addressing the gap in existing benchmarks that focus solely on speech. VideoFDB comprises 237 dyadic video clips with diverse nonverbal conversational dynamics and a rubric-based evaluation framework using LMs as judges. Experiments using VideoFDB reveal that current vision-speech agents struggle with captioning collapse, visual-stream ignorance, and fail to leverage vision for streaming joint audiovisual grounding, highlighting the need for architectural improvements to support full-duplex nonverbal cues.

Key Contribution

Current vision-speech agents are surprisingly bad at mimicking the subtle, real-time audio-visual cues that make human conversation feel natural.

Abstract

Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent interaction, agents must model full-duplex audiovisual conversation; however, existing full-duplex benchmarks evaluate only speech. In this work, we present VideoFDB, the first benchmark to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents. VideoFDB contributes (i) 237 dyadic clips spanning 11 nonverbal conversational dynamics from real-world video calls, (ii) a taxonomy separating perception from generation behaviors, and (iii) a rubric-based LM-as-judge evaluation framework with interpretable axes for assessing conversational quality with respect to nonverbal conversational dynamics. Across open- and closed-source vision-speech agents, we find systematic failure modes: captioning collapse and visual-stream ignorance, and we show that current systems exploit vision for explicit visual question answering but not for the streaming joint audiovisual grounding required in natural conversation. We further evaluate cascaded speech-to-avatar systems and find that their architecture fundamentally precludes the production of full-duplex nonverbal cues. As the first benchmark for full-duplex AV2AV interaction, VideoFDB establishes a foundation for systematic evaluation and, we hope, will accelerate the advancement and development of next-generation multimodal conversational agents.

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References56

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

Related Papers