Search papers, labs, and topics across Lattice.
The paper introduces Timestamped Audio Captioner (TAC), a model designed to generate temporally grounded audio descriptions for complex acoustic scenes, addressing the limitations of Large Audio Language Models in disentangling overlapping events. TAC is trained using a synthetic data pipeline that creates challenging mixtures from real-world audio, enhancing robustness in polyphonic conditions. The model achieves state-of-the-art performance in event detection and dense captioning, demonstrating low hallucination rates and accurate temporal grounding, and further serves as a semantic bridge to improve performance on audio and audio-visual reasoning benchmarks when cascaded with LLMs.
A new model, TAC, uses synthetic training data to achieve state-of-the-art audio and audio-visual reasoning by generating temporally grounded captions that can then be fed into LLMs.
Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions at varying degrees of detail and resolution. TAC is trained with a synthetic data pipeline that constructs challenging and dynamic mixtures from real-world audio sources, enabling robust learning under realistic polyphonic conditions. Across event detection and dense captioning, TAC outperforms all competing methods, with a low hallucination rate and accurate temporal grounding. We also introduce TAC-V, an audio-visual pipeline to generate semantically rich audio-visual descriptions. We then show that TAC and TAC-V serves as a "semantic bridge" for a text-only reasoner: a simple TAC$\rightarrow$LLM and TAC-V$\rightarrow$LLM cascade achieves state-of-the-art scores on benchmarks for both audio (MMAU-Pro, MMSU, MMAR) and audio-visual (DailyOmni, VideoHolmes) understanding and reasoning respectively.