Search papers, labs, and topics across Lattice.
Intel
2
0
6
Splitting attention and feedforward networks onto separate GPUs can unlock 4x higher MoE LLM throughput, but only if you carefully tune the GPU partitioning strategy based on the workload.
Agentic workloads aren't just long prompts; they're decode-bound beasts with a tool-use personality arc, demanding a rethink of LLM serving infrastructure.