Microsoft Research

×Inference & Quantization

15 papers from Microsoft Research on Inference & Quantization

Mar 19, 2026

1w ago·also Microsoft Research, Stevens

Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution

LLM agents can slash task completion time by almost 50% simply by predicting and pre-executing likely tool calls.

Yifan Sui, Han Zhao, Rui Ma +4

Inference & Quantization Tool Use & Agents

Mar 18, 2026

BAIR2w ago·also Microsoft Research

Offload or Overload: A Platform Measurement Study of Mobile Robotic Manipulation Workloads

Running robotic manipulation workloads entirely onboard kills robot batteries, but offloading to the cloud tanks accuracy due to network latency, revealing a critical compute placement trade-off.

Sara Pohland, Sara Pohland, Xenofon Foukas +10

Distributed Systems & Hardware Inference & Quantization Robotics & Embodied AI

Mar 9, 2026

CMU ML3w ago·also Microsoft Research, Brandeis, Glasgow, USC

CAGR: A Cross-Accelerator Graph Optimization Framework for Efficient Recommender System Inference

Get near-peak performance for your recommender system across GPUs and TPUs without tedious platform-specific tuning, thanks to a new cross-accelerator graph optimization framework.

Zijian Shen, Wenyu Zhao, Boyuan Wang +2

Distributed Systems & Hardware Inference & Quantization Recommendation & Information Retrieval

Microsoft Research3w ago·also Baidu

Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference

Particle filtering reveals a fundamental limit to inference-time sampling methods for LLMs, suggesting that simply increasing the number of samples has diminishing returns.

Noah Golowich, Dhruv Rohatgi, Raghav Singhal +2

Inference & Quantization Reasoning & Chain-of-Thought

Mar 5, 2026

Microsoft Research3w ago·also BIT

SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

Unlock 33% faster LLM inference on commodity GPUs with SlideSparse, which finally brings hardware-accelerated (2N-2):2N sparsity to the masses, bridging the accuracy gap left by NVIDIA's strict 2:4 pruning.

Hanyong Shao, Yingbo Hao, Ting Song +9

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Microsoft Research3w ago·also BIT, PKU

Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity

1.58-bit LLMs are surprisingly more resilient to sparsity than their full-precision counterparts, opening new avenues for extreme compression.

Di Zhang, Xun Wu, Shaohan Huang +9

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Mar 4, 2026

Amazon ScienceMar 4, 2026

Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning

Save 20% on LLM costs with <2% accuracy drop by strategically cascading a small model with a large one, guided by a confidence-calibrated SLM.

Chuang Zhang, Zizhen Zhu, Yihao Wei +4

Eval Frameworks & Benchmarks Inference & Quantization Reasoning & Chain-of-Thought

Mar 3, 2026

Mar 3, 2026·also Meta AI, Microsoft Research

Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models

Forget same-family constraints: you can compress prompts for LLaMA with a Qwen draft model and still get 90-100% of the original performance.

Shubhangi Upasani, Ravi Shanker Raju, Mengmeing Ji +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Tool Use & Agents

Mar 2, 2026

Microsoft ResearchMar 2, 2026·also M show up to 21.

Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning

Speculative decoding gets a throughput boost of up to 4.32x by using reinforcement learning to dynamically balance drafting and verification.

Jiebin Zhang, Zhenghan Yu, Eugene J. Yu +6

Inference & Quantization RLHF & Preference Learning

Feb 25, 2026

Microsoft ResearchFeb 25, 2026

HybridINR-PCGC: Hybrid Lossless Point Cloud Geometry Compression Bridging Pretrained Model and Implicit Neural Representation

Achieve up to 57% better point cloud compression by combining the generalization of pretrained models with the robustness of implicit neural representations.

Wenjie Huang, Qi Yang, Shuting Xia +2

Computer Vision Inference & Quantization

Feb 18, 2026

Microsoft ResearchFeb 18, 2026

Training Large Reasoning Models Efficiently via Progressive Thought Encoding

Forget full-cache rollouts: this parameter-efficient fine-tuning method lets large reasoning models maintain accuracy while slashing memory usage during RL training.

Zeliang Zhang, XiaoDong Liu, Hao Cheng +3

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

Feb 16, 2026

Microsoft ResearchFeb 16, 2026

Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation

Scaling laws hold for interest modeling: bigger LLMs and more inference-time sampling consistently boost news recommendation quality, and can be distilled into smaller, deployable models.

Mengdan Zhu, Yufan Zhao, Tao Di +1

Inference & Quantization Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Feb 12, 2026

Microsoft ResearchFeb 12, 2026

On-Policy Context Distillation for Language Models

Language models can now internalize experiential knowledge and system prompts more effectively through on-policy context distillation, leading to better task accuracy and out-of-distribution generalization.

Tianzhu Ye

Inference & Quantization Natural Language Processing Training Efficiency & Optimization

Feb 7, 2026

Microsoft ResearchFeb 7, 2026

Optimizing Few-Step Generation with Adaptive Matching Distillation

By explicitly detecting and escaping "Forbidden Zones" during training, AMD unlocks significant gains in sample fidelity and training robustness for few-step generative models like SDXL.

Lichen Bai, Zikai Zhou, Shitong Shao +6

Inference & Quantization Training Efficiency & Optimization

Mar 10, 2025

Microsoft ResearchMar 10, 2025·also Google Research, UCLA

Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation

Forget hand-annotated data: Magnet distills multi-turn tool-use skills into LLMs by automatically generating training trajectories that outperform even Gemini 1.5 Pro.

Fan Yin, Zifeng Wang, I-Hung Hsu +920

Data Curation & Synthetic Data Inference & Quantization Tool Use & Agents

Search

Microsoft Research