Search papers, labs, and topics across Lattice.
Shanghai Jiao Tong University, University of Connecticut
2
0
3
Forget agonizing over checkpointing and restarts: LiveR slashes LLM training downtime from minutes to seconds by hot-swapping model state between parallel training worlds.
Hopper's idle NVLink Copy Engine can be turned into a nearly free communication channel for MoE load balancing, slashing token stragglers by up to 70% without impacting existing parallelism.