Search papers, labs, and topics across Lattice.
2
0
5
3
Achieve 14x attention speedups and 60% end-to-end latency reduction in long-context LLMs without sacrificing quality by reusing prior attention computations.
Skewed communication patterns are leaving massive GPU cluster bandwidth on the table, but NIMBLE unlocks up to 5.2x higher throughput by dynamically balancing traffic at runtime.