Search papers, labs, and topics across Lattice.
This paper introduces TimePro-RL, a framework to improve temporal perception in Large Audio-Language Models (LALMs) by encoding timestamps as embeddings interleaved within audio features. They use Reinforcement Learning (RL) after Supervised Fine-Tuning (SFT) to directly optimize temporal alignment. Experiments show TimePro-RL significantly improves performance on audio grounding, sound event detection, and dense audio captioning tasks, demonstrating its effectiveness in fine-grained temporal understanding.
LALMs can gain a far more precise sense of time by simply interleaving learned time embeddings into their audio feature sequences and then being fine-tuned with RL.
Large Audio-Language Models (LALMs) enable general audio understanding and demonstrate remarkable performance across various audio tasks. However, these models still face challenges in temporal perception (e.g., inferring event onset and offset), leading to limited utility in fine-grained scenarios. To address this issue, we propose Audio-Side Time Prompt and leverage Reinforcement Learning (RL) to develop the TimePro-RL framework for fine-grained temporal perception. Specifically, we encode timestamps as embeddings and interleave them within the audio feature sequence as temporal coordinates to prompt the model. Furthermore, we introduce RL following Supervised Fine-Tuning (SFT) to directly optimize temporal alignment performance. Experiments demonstrate that TimePro-RL achieves significant performance gains across a range of audio temporal tasks, such as audio grounding, sound event detection, and dense audio captioning, validating its robust effectiveness.