USTCApr 15, 2026arXiv:2604.13715

Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt

Yanfeng Shi, Pengfei Cai, Qing Gu, Nan Jiang, Li-Rong Dai, Ian McLoughlin

AI Summary

This paper introduces TimePro-RL, a framework to improve temporal perception in Large Audio-Language Models (LALMs) by encoding timestamps as embeddings interleaved within audio features. They use Reinforcement Learning (RL) after Supervised Fine-Tuning (SFT) to directly optimize temporal alignment. Experiments show TimePro-RL significantly improves performance on audio grounding, sound event detection, and dense audio captioning tasks, demonstrating its effectiveness in fine-grained temporal understanding.

Key Contribution

LALMs can gain a far more precise sense of time by simply interleaving learned time embeddings into their audio feature sequences and then being fine-tuned with RL.

Abstract

Large Audio-Language Models (LALMs) enable general audio understanding and demonstrate remarkable performance across various audio tasks. However, these models still face challenges in temporal perception (e.g., inferring event onset and offset), leading to limited utility in fine-grained scenarios. To address this issue, we propose Audio-Side Time Prompt and leverage Reinforcement Learning (RL) to develop the TimePro-RL framework for fine-grained temporal perception. Specifically, we encode timestamps as embeddings and interleave them within the audio feature sequence as temporal coordinates to prompt the model. Furthermore, we introduce RL following Supervised Fine-Tuning (SFT) to directly optimize temporal alignment performance. Experiments demonstrate that TimePro-RL achieves significant performance gains across a range of audio temporal tasks, such as audio grounding, sound event detection, and dense audio captioning, validating its robust effectiveness.

Multimodal Models RLHF & Preference Learning Speech & Audio

Citation Metrics

Citations0

Influential citations0

References32

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt

Related Papers