Microsoft Research

×Natural Language Processing

23 papers from Microsoft Research on Natural Language Processing

Apr 30, 2026

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

TwinGate stops jailbreaks by tracking malicious intent across anonymized, interleaved queries with minimal overhead, something previous defenses couldn't do.

Bowen Sun, Chaozhuo Li, Yaodong Yang +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Red-Teaming & Adversarial Robustness

Apr 20, 2026

Microsoft ResearchApr 20, 2026

NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization

Discrete diffusion models can be sped up by 14x by intelligently choosing which tokens to sample at each step, without sacrificing accuracy.

Enshu Liu, Xuefei Ning, Zinan Lin

Natural Language Processing

Microsoft ResearchApr 20, 2026·also iiit.ac.in, IIT, Independent

DocQAC: Adaptive Trie-Guided Decoding for Effective In-Document Query Auto-Completion

Surprisingly, a trie-guided decoding framework applied to smaller encoder-decoder models like T5 and BART can outperform much larger instruction-tuned models like LLaMA-3 and Phi-3 in in-document query auto-completion.

Rahul Mehta, Kavin R, V. KavinR +4

Natural Language Processing Recommendation & Information Retrieval

Microsoft ResearchApr 20, 2026

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

Token-level attribution struggles to pinpoint the causes of LLM failures in realistic settings, suggesting current interpretability tools may not be up to the task of debugging complex model behaviors.

Rongyuan Tan, Jue Zhang, Zhuozhao Li +4

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Apr 19, 2026

Microsoft ResearchApr 19, 2026·also Google Research, UW, AI for Good, Department of Biochemistry Institute for Protein +1

RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design

RosettaSearch recovers up to 68% more structural fidelity in protein designs, transforming how we optimize sequences beyond traditional single-pass methods.

Meghana Kshirsagar, Allen Nie, Ching-An Cheng +5

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Scientific Discovery & Drug Design

Apr 15, 2026

Microsoft ResearchApr 15, 2026·also KTH, SEU, ZJU

DUET: Joint Exploration of User Item Profiles in Recommendation System

Forget hand-crafted templates: DUET learns to generate user and item profiles jointly, boosting recommendation accuracy by better aligning textual representations.

Yifei Sun, Yifei Sun, Lu Wang +20

Natural Language Processing Recommendation & Information Retrieval

Apr 14, 2026

Microsoft ResearchApr 14, 2026·also Virginia Tech

WebXSkill: Skill Learning for Autonomous Web Agents

Autonomous web agents get a serious upgrade with WebXSkill, which lets them learn and execute skills with both code-level precision and human-readable guidance.

Zhaoyang Wang, Qianhui Wu, Xuchao Zhang +16

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Apr 13, 2026

Apr 13, 2026·also Microsoft Research, UW

Discourse Diversity in Multi-Turn Empathic Dialogue

LLMs are twice as likely as humans to repeat the same support tactic in a conversation, but a simple RL reward for tactic novelty can fix it.

Hongli Zhan, Emma S. Gueorguieva, Javier Hernandez +2

Eval Frameworks & Benchmarks Natural Language Processing

Apr 9, 2026

Microsoft ResearchApr 9, 2026·also MIT CSAIL

From Gaze to Guidance: Interpreting and Adapting to Users'Cognitive Needs with Multimodal Gaze-Aware AI Assistants

Gaze-tracking unlocks a new level of personalized AI assistance, enabling LLMs to infer user cognitive states and boost recall performance.

Valdemar Danry, Javier Hernandez, Andrew Wilson +3

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing+1

Mar 31, 2026

Mar 31, 2026·also Microsoft Research, Independent

Drift-Aware Continual Tokenization for Generative Recommendation

Generative recommendation systems can now adapt to evolving user behavior without catastrophic forgetting, thanks to a novel drift-aware tokenization method that selectively updates item representations.

Yuebo Feng, Yu-Hao Feng, Jiahao Liu +6

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Recommendation & Information Retrieval

Mar 19, 2026

Mar 19, 2026·also Microsoft Research

HypeMed: Enhancing Medication Recommendations with Hypergraph-Based Patient Relationships

Hypergraph modeling of patient visits, coupled with contrastive pre-training, significantly boosts medication recommendation accuracy and safety by capturing complex relationships missed by traditional graph-based approaches.

Xiangxu Zhang, Xiao Zhou, Hongteng Xu +1

Natural Language Processing Recommendation & Information Retrieval Scientific Discovery & Drug Design

Mar 17, 2026

Mar 17, 2026·also Microsoft Research, USC

Evaluating LLM-Simulated Conversations in Modeling Inconsistent and Uncollaborative Behaviors in Human Social Interaction

LLMs, even when prompted or fine-tuned, struggle to replicate the messy reality of human conversation, raising serious questions about their utility as proxies for social interaction.

Ryo Kamoi, Ameya Godbole, Longqi Yang +2

Eval Frameworks & Benchmarks Natural Language Processing

Mar 16, 2026

Microsoft ResearchMar 16, 2026

The Hrunting of AI: Where and How to Improve English Dialectal Fairness

LLMs' ability to fairly represent English dialects hinges on the quality of human consensus, revealing a fundamental challenge in improving performance for low-resource locales.

Adrian de Wynter

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Mar 10, 2026

Microsoft ResearchMar 10, 2026·also School of Artificial Intelligence

CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research?

LLMs still can't automate real-world threat research, struggling with accuracy and nuanced expertise in a new benchmark derived from a world-leading company's CTI workflow.

Xiangsen Chen, Shuo Chen, Matthieu Maitre +3

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Mar 6, 2026

Microsoft ResearchMar 6, 2026

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

LLMs writing long stories frequently contradict themselves on basic facts and timelines, especially in the middle of the narrative, highlighting a critical weakness in long-form generation.

Junjie Li, Xinru Guo, Xinrui Guo +7

Eval Frameworks & Benchmarks Natural Language Processing

Mar 1, 2026

Mar 1, 2026·also Microsoft Research, Rutgers, Shanghai Key Laboratory of Multimodal

Individual Turing Test: A Case Study of LLM-based Simulation Using Longitudinal Personal Data

LLMs can mimic your style, but your friends can still tell it's not really you, especially when it comes to your opinions.

Ziyi Ye, Xi Zhu, Dimitris N. Metaxas

Eval Frameworks & Benchmarks Natural Language Processing

Feb 26, 2026

Microsoft ResearchFeb 26, 2026·also Tsinghua AI, Beihang, CAS, Shanghai AI Lab +1

MoDora: Tree-Based Semi-Structured Document Analysis System

LLMs can now more accurately answer questions on complex documents thanks to a new system that understands layout and hierarchical relationships between document components.

Bangrui Xu, Qihang Yao, Qihang Yao +10

Computer Vision Natural Language Processing Recommendation & Information Retrieval

Feb 19, 2026

Feb 19, 2026·also Microsoft Research, School of Artificial Intelligence

Web Verbs: Typed Abstractions for Reliable Task Composition on the Agentic Web

Imagine a world where web agents don't just click and type, but orchestrate complex tasks with the reliability of APIs – Web Verbs offer a path to that future.

Rui Xi, Zhijie Liu, Shuo Chen

Natural Language Processing Tool Use & Agents

Feb 18, 2026

Microsoft ResearchFeb 18, 2026

Verifiable Semantics for Agent-to-Agent Communication

Guaranteeing consistent communication between AI agents is now possible: a new certification protocol slashes disagreement by up to 96% by ensuring agents share a common understanding of terms.

Matt Carlson, Chris Schneider, Chris Daly

Natural Language Processing Tool Use & Agents

Feb 16, 2026

Computer Science and EngineeringFeb 16, 2026·also Microsoft Research, Notre Dame

Key Considerations for Domain Expert Involvement in LLM Design and Evaluation: An Ethnographic Study

LLM development teams often resort to workarounds and augmentation strategies when faced with the practical challenges of integrating domain experts, revealing a gap between ideal participatory design and real-world constraints.

Annalisa Szymanski, Oghenemaro Anuyah, Toby Jia-Jun Li

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Feb 15, 2026

Microsoft ResearchFeb 15, 2026

Experiential Reinforcement Learning

By explicitly prompting for reflection on failure, ERL unlocks up to 81% better performance in complex RL tasks and 11% gains in tool-using reasoning.

Sihao Chen, Bowen Jiang, Longqi Yang

Natural Language Processing RLHF & Preference Learning

Feb 2, 2026

Microsoft ResearchFeb 2, 2026·also Bing Ads, Kuaishou, Washington State

AdNanny: One Reasoning LLM for All Offline Ads Recommendation Tasks

Ditch the army of task-specific models: AdNanny shows a single, reasoning-centric LLM can handle diverse offline advertising tasks with improved accuracy and reduced manual effort.

Nan Hu, Han Li, Jimeng Sun +16

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Jan 15, 2026

Microsoft ResearchJan 15, 2026

BYOL: Bring Your Own Language Into LLMs

LLMs can get a 12% performance boost in low-resource languages by using a new framework that tailors data refinement, synthetic text generation, and continual pretraining to each language's digital footprint.

Syed Waqas Zamir, W. Hamidouche, B. Amor +3

Data Curation & Synthetic Data Natural Language Processing