Microsoft Research

One of the world's largest corporate research labs, spanning AI, systems, and human-computer interaction.

www.microsoft.com

Total papers

Total citations

377

Avg citations

Top Researchers

Eric HorvitzSébastien BubeckAhmed Awadallah

Recent Papers

Feb 12, 2026

2d ago

On-Policy Context Distillation for Language Models

The paper introduces On-Policy Context Distillation (OPCD), a method for distilling in-context knowledge into language models by training a student model on its own generated trajectories. OPCD minimizes the reverse Kullback-Leibler divergence between the student's output and a context-conditioned teacher model, effectively bridging on-policy and context distillation. Experiments across mathematical reasoning, text-based games, and domain-specific tasks demonstrate that OPCD outperforms baselines in task accuracy and out-of-distribution generalization, while also enabling effective cross-size distillation.

Introduces On-Policy Context Distillation (OPCD), a novel framework for language model distillation that leverages on-policy training with reverse KL divergence to internalize in-context knowledge.

Tianzhu Ye2602.12275

Inference & QuantizationTraining Efficiency & OptimizationNatural Language Processing

Feb 2, 2026

1w ago

AdNanny: One Reasoning LLM for All Offline Ads Recommendation Tasks

The paper introduces AdNanny, a unified reasoning-centric LLM fine-tuned from a 671B DeepSeek-R1 checkpoint for various offline advertising tasks. They construct reasoning-augmented corpora with structured supervision and natural language explanations, and then use multi-task supervised fine-tuning with adaptive reweighting followed by reinforcement learning to align with online advertising objectives. Deployed in Bing Ads, AdNanny reduces manual labeling effort and improves accuracy, demonstrating a scalable and cost-effective solution by consolidating task-specific models.

The paper demonstrates that a single, reasoning-centric LLM, AdNanny, can effectively replace multiple task-specific models for offline advertising tasks, leading to improved accuracy and reduced manual effort.

Nan Hu, Han Li, Jimeng Sun +162602.01563

Recommendation & Information RetrievalNatural Language ProcessingReasoning & Chain-of-Thought

Jan 15, 2026

BYOL: Bring Your Own Language Into LLMs

The paper introduces Bring Your Own Language (BYOL), a framework for developing language-aware LLMs tailored to languages' digital resource availability. BYOL classifies languages into resource tiers and applies different integration pathways: a data refinement and expansion pipeline for low-resource languages (demonstrated on Chichewa and Maori), and a translation-mediated approach for extreme-low-resource languages (demonstrated on Inuktitut). Experiments show that BYOL improves performance on low-resource languages by approximately 12% compared to multilingual baselines, while maintaining English and multilingual capabilities, and enables high-accuracy LLM access for extreme-low-resource languages via improved translation.

Introduces a tiered framework, BYOL, for language-aware LLM development that tailors integration pathways based on a language's digital resource availability.

Syed Waqas Zamir, W. Hamidouche, B. Amor +32601.10804

Data Curation & Synthetic DataNatural Language Processing

Dec 15, 2025

SIGMA: An AI-Empowered Training Stack on Early-Life Hardware

The paper introduces SIGMA, an open-source training stack designed to enhance the reliability, stability, and efficiency of large-scale distributed training on early-life AI hardware by addressing system disruptions, numerical errors, and parallelism optimization complexities. SIGMA incorporates the LUCIA TRAINING PLATFORM (LTP), which achieved 94.45% effective cluster accelerator utilization, and the LUCIA TRAINING FRAMEWORK (LTF), which successfully trained a 200B MoE model (SIGMA-MOE) with 2,048 AI accelerators, reaching 21.08% MFU and state-of-the-art downstream accuracy. This work demonstrates a robust and cost-effective alternative to existing accelerator stacks for large-scale AI training.

Introduces SIGMA, a comprehensive training stack that significantly improves the reliability, stability, and efficiency of large-scale AI training on early-life hardware.

L. Qu, Lianhai Ren, Peng Cheng +122512.13488

Training Efficiency & OptimizationDistributed Systems & HardwareArchitecture Design (Transformers, SSMs, MoE)

Dec 13, 2025

AI Transparency Atlas: Framework, Scoring, and Real-Time Model Card Evaluation Pipeline

The paper introduces a weighted transparency framework based on the EU AI Act and Stanford Transparency Index to evaluate AI model documentation, addressing the current fragmentation and inconsistency. They developed an automated multi-agent pipeline leveraging LLMs to extract documentation and score completeness across 50 models, revealing significant gaps, especially in safety-critical categories. The evaluation shows frontier labs achieve higher compliance (around 80%) compared to other providers (below 60%), highlighting areas for improvement in AI transparency.

Introduces a novel weighted transparency framework and automated evaluation pipeline to systematically assess and score the completeness of AI model documentation.

Akhmadillo Mamirov, Faiaz Azmain, Hanyu Wang2512.12443

Constitutional AI & AI EthicsEval Frameworks & BenchmarksOpen-Source Models & Weights

Oct 13, 2025

Zero Data Retention in LLM-based Enterprise AI Assistants: A Comparative Study of Market Leading Agentic AI Products

This paper investigates the implementation of zero data retention policies in enterprise AI assistants built on LLMs, focusing on the architectural, compliance, and usability trade-offs. It analyzes the technical architectures of Salesforce AgentForce and Microsoft Copilot, two leading AI assistants, to understand how they achieve zero data retention. The study finds distinct architectural approaches employed by Salesforce and Microsoft to support zero data retention, highlighting the challenges and solutions in balancing data privacy with usability.

Analyzes the architectural and deployment strategies of Salesforce AgentForce and Microsoft Copilot to achieve zero data retention, revealing the trade-offs between compliance, usability, and technical architecture.

Komal Gupta, Aditya Shrivastava2510.11558

Tool Use & AgentsConstitutional AI & AI EthicsEval Frameworks & Benchmarks

Aug 21, 2025

Aug 21, 2025·affiliated lab: Microsoft Research

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

The paper introduces LiveMCP-101, a benchmark of 101 real-world queries designed to stress test AI agents' ability to solve multi-step tasks using diverse MCP tools. It addresses the gap in evaluating AI agents' effectiveness in dynamic scenarios by requiring coordinated use of tools like web search, file operations, and data analysis. The benchmark reveals that even state-of-the-art LLMs struggle, achieving success rates below 60%, and highlights inefficiencies in token usage and tool orchestration.

Introduces LiveMCP-101, a challenging benchmark with ground-truth execution plans, to rigorously evaluate and diagnose the performance of MCP-enabled agents in realistic, multi-step tool-use scenarios.

Ming Yin, Dinghan Shen, Silei Xu +11112508.15760

Eval Frameworks & BenchmarksTool Use & Agents

Aug 15, 2025

Aug 15, 2025·affiliated lab: NVIDIA Research

Accelerating Biomolecular Modeling with AtomWorks and RF3

The authors introduce AtomWorks, a data framework designed to streamline the development of biomolecular foundation models for tasks like structure prediction and protein design. Using AtomWorks, they trained RosettaFold-3 (RF3), a structure prediction network that improves chirality handling, leading to performance closer to AlphaFold3. The release of AtomWorks, training data, and RF3 model weights under a BSD license aims to accelerate open-source biomolecular machine learning research.

Introduces AtomWorks, a comprehensive data framework, and leverages it to train RF3, a structure prediction network with enhanced chirality treatment, bridging the performance gap with closed-source models.

Nathaniel Corley, Simon V. Mathis, Rohith Krishna +2715

Scientific Discovery & Drug DesignData Curation & Synthetic DataTraining Efficiency & Optimization

Jul 7, 2025

Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

This paper addresses the underexplored NLP tasks of structured tabular reporting from nurse dictations and medical order extraction from doctor-patient consultations, which are critical for reducing healthcare provider documentation burden. The authors evaluate the performance of both open- and closed-weight LLMs on these tasks using private and newly released open-source datasets (SYNUR and SIMORD). They also propose an agentic pipeline for generating realistic, non-sensitive nurse dictations to facilitate structured extraction of clinical observations.

Introduces SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction, respectively, and evaluates LLM performance on these tasks.

Jean-Philippe Corbeil, Asma Ben Abacha, George Michalopoulos +122507.05517

Natural Language ProcessingSpeech & AudioTool Use & Agents

Jun 22, 2025

Adapting Vision-Language Models for Evaluating World Models

The paper introduces UNIVERSE, a VLM-based evaluator, to address the challenge of evaluating video world model rollouts by assessing action alignment and semantic consistency. They adapt VLMs under data and compute constraints using full, partial, and parameter-efficient methods across various task formats and environments. The resulting UNIVERSE evaluator achieves parity with task-specific checkpoints and demonstrates strong alignment with human judgments in action and character recognition tasks.

Introduces UNIVERSE, a unified VLM-based evaluator, that effectively assesses video world model rollouts by adapting to fine-grained, temporally sensitive evaluation tasks under data and compute constraints.

Mariya Hendriksen, Tabish Rashid, David Bignell +52506.17967

Multimodal ModelsWorld Models & PlanningEval Frameworks & Benchmarks

Jun 10, 2025

WIP: Large Language Model-Enhanced Smart Tutor for Undergraduate Circuit Analysis

This paper introduces an AI-enabled smart tutor leveraging large language models (LLMs) to provide homework assessment and feedback for undergraduate circuit analysis students. The tutor, deployed on Microsoft Azure, uses carefully crafted prompts to optimize open-ended question answering and feedback generation. Preliminary evaluation based on student feedback shows 90.9% satisfaction, and analysis of usage data identifies common student difficulties for instructors.

Demonstrates the feasibility and positive reception of an LLM-enhanced smart tutor for circuit analysis, highlighting its potential for personalized instruction and real-time feedback.

Liangliang Chen, Huiru Xie, Jacqueline Rohde +12506.08962

Reasoning & Chain-of-ThoughtTool Use & AgentsNatural Language Processing

May 21, 2025

A foundation model for the Earth system

The authors introduce Aurora, a foundation model for Earth system forecasting trained on over one million hours of diverse geophysical data. Aurora achieves state-of-the-art performance in predicting air quality, ocean waves, tropical cyclone tracks, and high-resolution weather. Critically, Aurora attains these results at orders of magnitude lower computational cost compared to traditional numerical weather prediction models, demonstrating the potential of AI for democratizing access to accurate environmental forecasts.

Demonstrates a single foundation model, Aurora, can outperform operational forecasting systems across diverse Earth system prediction tasks while drastically reducing computational cost.

Cristian Bodnar, W. Bruinsma, Ana Lucic +15202

Scientific Discovery & Drug DesignWorld Models & Planning

Apr 16, 2025

BitNet b1.58 2B4T Technical Report

The paper introduces BitNet b1.58 2B4T, a 2-billion parameter, 1-bit Large Language Model trained on 4 trillion tokens. It demonstrates that this 1-bit LLM achieves performance comparable to full-precision LLMs of similar size across diverse benchmarks. The key result is a significant improvement in computational efficiency, including reduced memory footprint, energy consumption, and decoding latency, while maintaining competitive performance.

Demonstrates the feasibility of training a 2B parameter 1-bit LLM that matches the performance of full-precision models while drastically reducing computational costs.

Shuming Ma, Hongyu Wang, Shaohan Huang +5232504.12285

Architecture Design (Transformers, SSMs, MoE)Open-Source Models & WeightsTraining Efficiency & Optimization

Mar 10, 2025

Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation

The paper introduces Magnet, a framework for synthesizing multi-turn tool-use training data for LLMs by automatically translating function signature paths into query and executable function call sequences. It models function interactions as a graph and uses node operations to build signature paths, then employs context distillation with positive (reference function calls) and negative hints (incorrect function calls) to guide trajectory generation. Fine-tuning a 14B model with Magnet-synthesized data achieves state-of-the-art performance on BFCL-v3 and ToolQuery, outperforming Gemini-1.5-pro-002.

Introduces a novel graph-based framework, Magnet, for synthesizing high-quality, multi-turn tool-use training data by translating function signature paths into realistic query-function call sequences.

Fan Yin, Zifeng Wang, I-Hung Hsu +9202503.07826

Tool Use & AgentsData Curation & Synthetic DataInference & Quantization

Feb 18, 2025

Magma: A Foundation Model for Multimodal AI Agents

The paper introduces Magma, a foundation model extending vision-language models with spatial-temporal intelligence for multimodal AI agents. Magma is trained on heterogeneous datasets of images, videos, and robotics data, using Set-of-Mark (SoM) for action grounding and Trace-of-Mark (ToM) for action planning. Experiments demonstrate that SoM and ToM enhance spatio-temporal intelligence, enabling Magma to achieve state-of-the-art results in UI navigation and robotic manipulation while maintaining strong multimodal understanding.

Introduces Set-of-Mark (SoM) and Trace-of-Mark (ToM) as novel methods for grounding actions and planning in visual-spatial environments, significantly improving agentic capabilities in multimodal AI models.

Jianwei Yang, Reuben Tan, Qianhui Wu +10992502.13130

Multimodal ModelsTool Use & AgentsRobotics & Embodied AI

Lattice is designed for desktop

Microsoft Research

Top Researchers

Recent Papers

Search