Mar 2, 2026arXiv:2603.01565

Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation

Yi Gu, Yi Gu, Yanqing Liu, Chen Yang, Shengchao Zhao, Sheng Zhao

AI Summary

This paper investigates reinforcement learning (RL) for text-to-audio (T2A) generation using diffusion transformer (DiT) architectures. They first use a large language model (LLM) to generate detailed audio captions to improve text-audio semantic alignment. Then, they fine-tune the T2A model using Group Relative Policy Optimization (GRPO) with various reward functions, demonstrating improved synthesis fidelity and prompt adherence.

Key Contribution

GRPO-based fine-tuning, guided by LLM-generated captions, significantly boosts text-to-audio synthesis fidelity and prompt adherence.

Abstract

Text-to-audio (T2A) generation has advanced considerably in recent years, yet existing methods continue to face challenges in accurately rendering complex text prompts, particularly those involving intricate audio effects, and achieving precise text-audio alignment. While prior approaches have explored data augmentation, explicit timing conditioning, and reinforcement learning, overall synthesis quality remains constrained. In this work, we experiment with reinforcement learning to further enhance T2A generation quality, building on diffusion transformer (DiT)-based architectures. Our method first employs a large language model (LLM) to generate high-fidelity, richly detailed audio captions, substantially improving text-audio semantic alignment, especially for ambiguous or underspecified prompts. We then apply Group Relative Policy Optimization (GRPO), a recently introduced reinforcement learning algorithm, to fine-tune the T2A model. Through systematic experimentation with diverse reward functions (including CLAP, KL, FAD, and their combinations), we identify the key drivers of effective RL in audio synthesis and analyze how reward design impacts final audio quality. Experimental results demonstrate that GRPO-based fine-tuning yield substantial gains in synthesis fidelity and prompt adherence.

Multimodal Models RLHF & Preference Learning Speech & Audio

Citation Metrics

Citations0

Influential citations0

References55

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation

Related Papers