Microsoft Research
One of the world's largest corporate research labs, spanning AI, systems, and human-computer interaction.
www.microsoft.com15
377
25
Top Researchers
Recent Papers
The paper introduces On-Policy Context Distillation (OPCD), a method for distilling in-context knowledge into language models by training a student model on its own generated trajectories. OPCD minimizes the reverse Kullback-Leibler divergence between the student's output and a context-conditioned teacher model, effectively bridging on-policy and context distillation. Experiments across mathematical reasoning, text-based games, and domain-specific tasks demonstrate that OPCD outperforms baselines in task accuracy and out-of-distribution generalization, while also enabling effective cross-size distillation.
Introduces On-Policy Context Distillation (OPCD), a novel framework for language model distillation that leverages on-policy training with reverse KL divergence to internalize in-context knowledge.
The paper introduces AdNanny, a unified reasoning-centric LLM fine-tuned from a 671B DeepSeek-R1 checkpoint for various offline advertising tasks. They construct reasoning-augmented corpora with structured supervision and natural language explanations, and then use multi-task supervised fine-tuning with adaptive reweighting followed by reinforcement learning to align with online advertising objectives. Deployed in Bing Ads, AdNanny reduces manual labeling effort and improves accuracy, demonstrating a scalable and cost-effective solution by consolidating task-specific models.
The paper demonstrates that a single, reasoning-centric LLM, AdNanny, can effectively replace multiple task-specific models for offline advertising tasks, leading to improved accuracy and reduced manual effort.
The paper introduces Bring Your Own Language (BYOL), a framework for developing language-aware LLMs tailored to languages' digital resource availability. BYOL classifies languages into resource tiers and applies different integration pathways: a data refinement and expansion pipeline for low-resource languages (demonstrated on Chichewa and Maori), and a translation-mediated approach for extreme-low-resource languages (demonstrated on Inuktitut). Experiments show that BYOL improves performance on low-resource languages by approximately 12% compared to multilingual baselines, while maintaining English and multilingual capabilities, and enables high-accuracy LLM access for extreme-low-resource languages via improved translation.
Introduces a tiered framework, BYOL, for language-aware LLM development that tailors integration pathways based on a language's digital resource availability.
The paper introduces SIGMA, an open-source training stack designed to enhance the reliability, stability, and efficiency of large-scale distributed training on early-life AI hardware by addressing system disruptions, numerical errors, and parallelism optimization complexities. SIGMA incorporates the LUCIA TRAINING PLATFORM (LTP), which achieved 94.45% effective cluster accelerator utilization, and the LUCIA TRAINING FRAMEWORK (LTF), which successfully trained a 200B MoE model (SIGMA-MOE) with 2,048 AI accelerators, reaching 21.08% MFU and state-of-the-art downstream accuracy. This work demonstrates a robust and cost-effective alternative to existing accelerator stacks for large-scale AI training.
Introduces SIGMA, a comprehensive training stack that significantly improves the reliability, stability, and efficiency of large-scale AI training on early-life hardware.
The paper introduces a weighted transparency framework based on the EU AI Act and Stanford Transparency Index to evaluate AI model documentation, addressing the current fragmentation and inconsistency. They developed an automated multi-agent pipeline leveraging LLMs to extract documentation and score completeness across 50 models, revealing significant gaps, especially in safety-critical categories. The evaluation shows frontier labs achieve higher compliance (around 80%) compared to other providers (below 60%), highlighting areas for improvement in AI transparency.
Introduces a novel weighted transparency framework and automated evaluation pipeline to systematically assess and score the completeness of AI model documentation.
This paper investigates the implementation of zero data retention policies in enterprise AI assistants built on LLMs, focusing on the architectural, compliance, and usability trade-offs. It analyzes the technical architectures of Salesforce AgentForce and Microsoft Copilot, two leading AI assistants, to understand how they achieve zero data retention. The study finds distinct architectural approaches employed by Salesforce and Microsoft to support zero data retention, highlighting the challenges and solutions in balancing data privacy with usability.
Analyzes the architectural and deployment strategies of Salesforce AgentForce and Microsoft Copilot to achieve zero data retention, revealing the trade-offs between compliance, usability, and technical architecture.
The paper introduces LiveMCP-101, a benchmark of 101 real-world queries designed to stress test AI agents' ability to solve multi-step tasks using diverse MCP tools. It addresses the gap in evaluating AI agents' effectiveness in dynamic scenarios by requiring coordinated use of tools like web search, file operations, and data analysis. The benchmark reveals that even state-of-the-art LLMs struggle, achieving success rates below 60%, and highlights inefficiencies in token usage and tool orchestration.
Introduces LiveMCP-101, a challenging benchmark with ground-truth execution plans, to rigorously evaluate and diagnose the performance of MCP-enabled agents in realistic, multi-step tool-use scenarios.
The authors introduce AtomWorks, a data framework designed to streamline the development of biomolecular foundation models for tasks like structure prediction and protein design. Using AtomWorks, they trained RosettaFold-3 (RF3), a structure prediction network that improves chirality handling, leading to performance closer to AlphaFold3. The release of AtomWorks, training data, and RF3 model weights under a BSD license aims to accelerate open-source biomolecular machine learning research.
Introduces AtomWorks, a comprehensive data framework, and leverages it to train RF3, a structure prediction network with enhanced chirality treatment, bridging the performance gap with closed-source models.
This paper addresses the underexplored NLP tasks of structured tabular reporting from nurse dictations and medical order extraction from doctor-patient consultations, which are critical for reducing healthcare provider documentation burden. The authors evaluate the performance of both open- and closed-weight LLMs on these tasks using private and newly released open-source datasets (SYNUR and SIMORD). They also propose an agentic pipeline for generating realistic, non-sensitive nurse dictations to facilitate structured extraction of clinical observations.
Introduces SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction, respectively, and evaluates LLM performance on these tasks.
The paper introduces UNIVERSE, a VLM-based evaluator, to address the challenge of evaluating video world model rollouts by assessing action alignment and semantic consistency. They adapt VLMs under data and compute constraints using full, partial, and parameter-efficient methods across various task formats and environments. The resulting UNIVERSE evaluator achieves parity with task-specific checkpoints and demonstrates strong alignment with human judgments in action and character recognition tasks.
Introduces UNIVERSE, a unified VLM-based evaluator, that effectively assesses video world model rollouts by adapting to fine-grained, temporally sensitive evaluation tasks under data and compute constraints.
This paper introduces an AI-enabled smart tutor leveraging large language models (LLMs) to provide homework assessment and feedback for undergraduate circuit analysis students. The tutor, deployed on Microsoft Azure, uses carefully crafted prompts to optimize open-ended question answering and feedback generation. Preliminary evaluation based on student feedback shows 90.9% satisfaction, and analysis of usage data identifies common student difficulties for instructors.
Demonstrates the feasibility and positive reception of an LLM-enhanced smart tutor for circuit analysis, highlighting its potential for personalized instruction and real-time feedback.
The authors introduce Aurora, a foundation model for Earth system forecasting trained on over one million hours of diverse geophysical data. Aurora achieves state-of-the-art performance in predicting air quality, ocean waves, tropical cyclone tracks, and high-resolution weather. Critically, Aurora attains these results at orders of magnitude lower computational cost compared to traditional numerical weather prediction models, demonstrating the potential of AI for democratizing access to accurate environmental forecasts.
Demonstrates a single foundation model, Aurora, can outperform operational forecasting systems across diverse Earth system prediction tasks while drastically reducing computational cost.
The paper introduces BitNet b1.58 2B4T, a 2-billion parameter, 1-bit Large Language Model trained on 4 trillion tokens. It demonstrates that this 1-bit LLM achieves performance comparable to full-precision LLMs of similar size across diverse benchmarks. The key result is a significant improvement in computational efficiency, including reduced memory footprint, energy consumption, and decoding latency, while maintaining competitive performance.
Demonstrates the feasibility of training a 2B parameter 1-bit LLM that matches the performance of full-precision models while drastically reducing computational costs.
The paper introduces Magnet, a framework for synthesizing multi-turn tool-use training data for LLMs by automatically translating function signature paths into query and executable function call sequences. It models function interactions as a graph and uses node operations to build signature paths, then employs context distillation with positive (reference function calls) and negative hints (incorrect function calls) to guide trajectory generation. Fine-tuning a 14B model with Magnet-synthesized data achieves state-of-the-art performance on BFCL-v3 and ToolQuery, outperforming Gemini-1.5-pro-002.
Introduces a novel graph-based framework, Magnet, for synthesizing high-quality, multi-turn tool-use training data by translating function signature paths into realistic query-function call sequences.
The paper introduces Magma, a foundation model extending vision-language models with spatial-temporal intelligence for multimodal AI agents. Magma is trained on heterogeneous datasets of images, videos, and robotics data, using Set-of-Mark (SoM) for action grounding and Trace-of-Mark (ToM) for action planning. Experiments demonstrate that SoM and ToM enhance spatio-temporal intelligence, enabling Magma to achieve state-of-the-art results in UI navigation and robotic manipulation while maintaining strong multimodal understanding.
Introduces Set-of-Mark (SoM) and Trace-of-Mark (ToM) as novel methods for grounding actions and planning in visual-spatial environments, significantly improving agentic capabilities in multimodal AI models.

