Search papers, labs, and topics across Lattice.
B [26] visual backbone. The action head is a conditional Flow Matching network implemented via an 8-layer Diffusion Transformer (DiT [16]) with a 1024 hidden dimension, trained to predict trajectories of horizon T=, Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, China
1
23
3
2
Ditch the clunky pipelines: SongGen generates complete songs from text in a single pass, offering unprecedented control over musical elements and voice cloning.