Search papers, labs, and topics across Lattice.
15 papers from Microsoft Research on Inference & Quantization
LLM agents can slash task completion time by almost 50% simply by predicting and pre-executing likely tool calls.
Running robotic manipulation workloads entirely onboard kills robot batteries, but offloading to the cloud tanks accuracy due to network latency, revealing a critical compute placement trade-off.
Get near-peak performance for your recommender system across GPUs and TPUs without tedious platform-specific tuning, thanks to a new cross-accelerator graph optimization framework.
Particle filtering reveals a fundamental limit to inference-time sampling methods for LLMs, suggesting that simply increasing the number of samples has diminishing returns.
Unlock 33% faster LLM inference on commodity GPUs with SlideSparse, which finally brings hardware-accelerated (2N-2):2N sparsity to the masses, bridging the accuracy gap left by NVIDIA's strict 2:4 pruning.
1.58-bit LLMs are surprisingly more resilient to sparsity than their full-precision counterparts, opening new avenues for extreme compression.
Save 20% on LLM costs with <2% accuracy drop by strategically cascading a small model with a large one, guided by a confidence-calibrated SLM.
Forget same-family constraints: you can compress prompts for LLaMA with a Qwen draft model and still get 90-100% of the original performance.
Speculative decoding gets a throughput boost of up to 4.32x by using reinforcement learning to dynamically balance drafting and verification.
Achieve up to 57% better point cloud compression by combining the generalization of pretrained models with the robustness of implicit neural representations.
Forget full-cache rollouts: this parameter-efficient fine-tuning method lets large reasoning models maintain accuracy while slashing memory usage during RL training.
Scaling laws hold for interest modeling: bigger LLMs and more inference-time sampling consistently boost news recommendation quality, and can be distilled into smaller, deployable models.
Language models can now internalize experiential knowledge and system prompts more effectively through on-policy context distillation, leading to better task accuracy and out-of-distribution generalization.
By explicitly detecting and escaping "Forbidden Zones" during training, AMD unlocks significant gains in sample fidelity and training robustness for few-step generative models like SDXL.
Forget hand-annotated data: Magnet distills multi-turn tool-use skills into LLMs by automatically generating training trajectories that outperform even Gemini 1.5 Pro.