Search papers, labs, and topics across Lattice.
Introduction Figure 1: Training loss vs. wall-clock time. EC reaches loss 3.75 in 10.6h, 2.
1
0
3
2
Diffusion language models can achieve faster convergence and improved accuracy simply by swapping token-choice routing for expert-choice routing, and further benefit from allocating more compute to early denoising steps.