Qi Zhang

×10−51\times 10^{-5}, and training continues for an additional 3,000 steps with a global batch size of 256. Infrastructure. We adopt a hybrid parallel optimization strategy during training. we enable tensor parallelism on the VLM side. For the diffusion model, we use parameter sharding (ZeRO Stage-2) together with bfloat16 (BF16) mixed-precision training. To keep sequence lengths uniform within a mini-batch, we maintain two independent bucketeers—by image aspect ratio (supporting 1:1, 1:2, 2:3, 3:4, 3:5, 4:5, and 9:16) and by the number of reference images—so that samples in the same batch produce the same number of latent tokens, reducing padding and improving throughput. Table 3: The data outline and training details about each training stage. Where, Q.Q. denotes the Query-kontext tokens, Con.Con. is Connector module. Stage Stage 1 Stage 2 Stage 3 Task Image Generation Image Generation Instruction Editing Image Reconstruction Image Reconstruction Customized Generation Image Transformation Multi-subject Type T

Papers on Lattice

Total citations

Topics

h-index

Research focus

Data Curation & Synthetic Data (1)Open-Source Models & Weights (1)Speech & Audio (1)

Frequent co-authors

Changhao Jiang (1)Jiahao Chen (1)Zhenghao Xiang (1)Zhixiong Yang (1)

Papers (1)

Jan 7, 2026

Google ResearchJan 7, 2026·also Fudan, HuggingFace

Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control

Finally, a fully open-source, reproducible system for long-form song generation is here, complete with licensed data, code, and a Qwen-based model that rivals closed-source systems.

Changhao Jiang, Jiahao Chen, Zhenghao Xiang +14

Data Curation & Synthetic Data Open-Source Models & Weights Speech & Audio

Search

Qi Zhang

Research focus

Frequent co-authors

Papers (1)