Google ResearchAdobe ResearchByteDanceUMDFeb 17, 2026arXiv:2602.15766

TAC: Timestamped Audio Captioning

Sonal Kumar, Prem Seetharaman, Ke Chen, Oriol Nieto, Jiaqi Su, Zhepei Wang, Rithesh Kumar, Nicholas J. Bryan, Zeyu Jin, Justin Salamon

AI Summary

The paper introduces Timestamped Audio Captioner (TAC), a model designed to generate temporally grounded audio descriptions for complex acoustic scenes, addressing the limitations of Large Audio Language Models in disentangling overlapping events. TAC is trained using a synthetic data pipeline that creates challenging mixtures from real-world audio, enhancing robustness in polyphonic conditions. The model achieves state-of-the-art performance in event detection and dense captioning, demonstrating low hallucination rates and accurate temporal grounding, and further serves as a semantic bridge to improve performance on audio and audio-visual reasoning benchmarks when cascaded with LLMs.

Key Contribution

A new model, TAC, uses synthetic training data to achieve state-of-the-art audio and audio-visual reasoning by generating temporally grounded captions that can then be fed into LLMs.

Abstract

Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions at varying degrees of detail and resolution. TAC is trained with a synthetic data pipeline that constructs challenging and dynamic mixtures from real-world audio sources, enabling robust learning under realistic polyphonic conditions. Across event detection and dense captioning, TAC outperforms all competing methods, with a low hallucination rate and accurate temporal grounding. We also introduce TAC-V, an audio-visual pipeline to generate semantically rich audio-visual descriptions. We then show that TAC and TAC-V serves as a "semantic bridge" for a text-only reasoner: a simple TAC$\rightarrow$LLM and TAC-V$\rightarrow$LLM cascade achieves state-of-the-art scores on benchmarks for both audio (MMAU-Pro, MMSU, MMAR) and audio-visual (DailyOmni, VideoHolmes) understanding and reasoning respectively.

Data Curation & Synthetic Data Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TAC: Timestamped Audio Captioning

Related Papers