Search papers, labs, and topics across Lattice.
This paper investigates reinforcement learning (RL) for text-to-audio (T2A) generation using diffusion transformer (DiT) architectures. They first use a large language model (LLM) to generate detailed audio captions to improve text-audio semantic alignment. Then, they fine-tune the T2A model using Group Relative Policy Optimization (GRPO) with various reward functions, demonstrating improved synthesis fidelity and prompt adherence.
GRPO-based fine-tuning, guided by LLM-generated captions, significantly boosts text-to-audio synthesis fidelity and prompt adherence.
Text-to-audio (T2A) generation has advanced considerably in recent years, yet existing methods continue to face challenges in accurately rendering complex text prompts, particularly those involving intricate audio effects, and achieving precise text-audio alignment. While prior approaches have explored data augmentation, explicit timing conditioning, and reinforcement learning, overall synthesis quality remains constrained. In this work, we experiment with reinforcement learning to further enhance T2A generation quality, building on diffusion transformer (DiT)-based architectures. Our method first employs a large language model (LLM) to generate high-fidelity, richly detailed audio captions, substantially improving text-audio semantic alignment, especially for ambiguous or underspecified prompts. We then apply Group Relative Policy Optimization (GRPO), a recently introduced reinforcement learning algorithm, to fine-tune the T2A model. Through systematic experimentation with diverse reward functions (including CLAP, KL, FAD, and their combinations), we identify the key drivers of effective RL in audio synthesis and analyze how reward design impacts final audio quality. Experimental results demonstrate that GRPO-based fine-tuning yield substantial gains in synthesis fidelity and prompt adherence.