Search papers, labs, and topics across Lattice.
3
0
5
FlashCP achieves up to 1.63x faster training for large language models by eliminating redundant communication and optimizing workload balance.
SimSD achieves a remarkable 7.46x increase in decoding throughput for diffusion language models without sacrificing generation quality.
Forget GPU-centric designs: AMMA slashes attention latency by 15x and energy consumption by 7x with a memory-centric architecture for long-context LLMs.