Search papers, labs, and topics across Lattice.
×10−51\times 10^{-5}, and training continues for an additional 3,000 steps with a global batch size of 256. Infrastructure. We adopt a hybrid parallel optimization strategy during training. we enable tensor parallelism on the VLM side. For the diffusion model, we use parameter sharding (ZeRO Stage-2) together with bfloat16 (BF16) mixed-precision training. To keep sequence lengths uniform within a mini-batch, we maintain two independent bucketeers—by image aspect ratio (supporting 1:1, 1:2, 2:3, 3:4, 3:5, 4:5, and 9:16) and by the number of reference images—so that samples in the same batch produce the same number of latent tokens, reducing padding and improving throughput. Table 3: The data outline and training details about each training stage. Where, Q.Q. denotes the Query-kontext tokens, Con.Con. is Connector module. Stage Stage 1 Stage 2 Stage 3 Task Image Generation Image Generation Instruction Editing Image Reconstruction Image Reconstruction Customized Generation Image Transformation Multi-subject Type T
1
1
3
1
Finally, a fully open-source, reproducible system for long-form song generation is here, complete with licensed data, code, and a Qwen-based model that rivals closed-source systems.