Microsoft Research

×Multimodal Models

34 papers from Microsoft Research on Multimodal Models

Jul 8, 2026

5d ago·also Microsoft Research, UW, Authors are listed in alphabetical order, Korea U +1

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

High-fidelity curation of medical multimodal data can drastically improve AI model performance, with MedPMC achieving remarkable clinical relevance and benchmark results.

Rui Shi, Gui Yang, Yuntian Liu +3

Data Curation & Synthetic Data Multimodal Models

Jul 7, 2026

DeepMind6d ago·also ANL, Artifex Labs, Chunghwa Telecom Laboratories, IIT Madras +5

Pluralis v0.1: Towards a Multicultural, Multimodal, Multilingual Benchmark for AI Risk and Reliability

VLMs are prone to critical failures that vary significantly across cultures, exposing the inadequacy of Western-centric safety benchmarks.

Alicia Parrish, Rajat Shinde, Sanket Badhe +69

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models

6d ago·also Microsoft Research

CAIRN: Cross-Room 3D Scene Understanding with Topology-Aware Large Multimodal Models

CAIRN redefines 3D scene understanding by seamlessly integrating room-level topology with object-level relations, achieving unprecedented performance in multi-room environments.

He Liang, Chenyang Ma, Yiming Zhang +4

Multimodal Models Reasoning & Chain-of-Thought

Jul 6, 2026

1w ago·also Microsoft Research, Corresponding author, D-causal VAE adapted from Wan2.2-TI, Drive. We further evaluate zero-shot

UNIVERSE: Unified Video Action Models for Autonomous Driving with Flexible Mask-Modulated Modality Generation

UNIVERSE achieves a remarkable 4.3× speedup in trajectory inference while maintaining planning accuracy, revolutionizing how video dynamics inform autonomous driving actions.

Mengmeng Liu, Diankun Zhang, Jiuming Liu +6

Multimodal Models World Models & Planning

Jul 5, 2026

Microsoft Research1w ago·also NUS, Tsinghua AI, Friedrich-Alexander-Universität, NTU +3

ResearchStudio-Reel: Automate the Last Mile of Research from Paper to Poster, Video, and Blog

ResearchStudio-Reel not only automates research dissemination but does so with unprecedented quality, outperforming both traditional methods and leading LLMs in aesthetic appeal and information accuracy.

Lingao Xiao, Yalun Dai, Yangyu Huang +16

Multimodal Models

Jul 2, 2026

Microsoft Research1w ago

Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving

Interleaving speech and text during ASR training boosts entity recognition accuracy and narrows the gap between modalities, challenging traditional training paradigms.

Ruchao Fan, Yiming Wang, Rui Zhao +10

Multimodal Models Speech & Audio

Microsoft Research1w ago·also Tsinghua AI, NJU

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

Achieving 100% task success in closed-loop execution, Embodied.cpp revolutionizes how embodied AI models are deployed across diverse hardware platforms.

Ling Xu, Chuyu Han, Borui Li +7

Inference & Quantization Multimodal Models Robotics & Embodied AI

Jul 1, 2026

Microsoft Research1w ago·also HKUST, ZJU

Ink3D: Sculpting 3D Assets with Extremely Complex Textures via Video Generative Models

Ink3D achieves a breakthrough in 3D asset creation, enabling the generation of complex textures that were previously unattainable with conventional methods.

Yue Han, Chong Li, Zhening Liu +4

Computer Vision Multimodal Models

Jun 29, 2026

Microsoft Research2w ago

Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation

PRIME-Speech achieves low-latency, accurate speech-to-speech generation without sacrificing the robust performance of existing speech-to-text models.

Heng Lu, Ruchao Fan, Yao Qian +5

Multimodal Models Speech & Audio

Jun 25, 2026

Microsoft Research2w ago·also Corresponding Author, USTC

Qwen-Image-Agent: Bridging the Context Gap in Real-World Image Generation

Bridging the Context Gap in T2I models, Qwen-Image-Agent achieves state-of-the-art performance by intelligently constructing context from user input and external sources.

Zekai Zhang, Jiahao Li, Kaiyuan Gao +17

Computer Vision Multimodal Models Tool Use & Agents

Jun 24, 2026

Microsoft Research2w ago·also Twente

Learning Robot Visual Navigation in Crowds via Intention-Aware Scene Representations

Robots can now navigate crowded spaces more effectively by understanding human intentions, thanks to a new method that integrates rich visual cues into their decision-making process.

Han Bao, Bingyi Xia, Hanjing Ye +4

Computer Vision Multimodal Models Robotics & Embodied AI

Jun 23, 2026

Tsinghua AI2w ago·also Microsoft Research, DUT, HKUST, PKU

MambaRaw: Selective State Space Modeling for Efficient 4K Raw Image Reconstruction

MambaRaw achieves a remarkable 1.4 dB increase in PSNR at low metadata bitrates while slashing coding latency by nearly 9%, setting a new benchmark in raw image reconstruction.

Fanhu Zeng, Tongda Xu, Xingguo Xu +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Jun 22, 2026

Tsinghua AI3w ago·also Microsoft Research, BAAI, Beihang, NTU +1

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

High-diversity training improves safety in VLA models, but sub-optimal trajectory synthesis still hinders task success.

Rongxu Cui, Zongzheng Zhang, Jingrui Pang +6

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Jun 17, 2026

Microsoft Research3w ago

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

Zero-shot transfer of a refined RL policy boosts manipulation success rates from 42% to 76% on real robots, showcasing a breakthrough in sim-to-real applications.

Kinam Kim, Heecheol Kim, Katsushi Ikeuchi +2

Multimodal Models Robotics & Embodied AI

Jun 16, 2026

Microsoft Research3w ago·also Tsinghua AI, PKU, Princeton

MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation

MuseVLA achieves an impressive 80.6% success rate in robotic manipulation tasks by leveraging diverse sensing modalities, surpassing traditional RGB-only models.

Xingyuming Liu, Ruichun Ma, Heyu Guo +6

Multimodal Models Robotics & Embodied AI

Jun 15, 2026

Jun 15, 2026·also Microsoft Research

Closed-Loop Triplet Synergistic Generation for Long-Form Video

CoTriSyGen achieves unprecedented long-range coherence in video generation by integrating visual evidence into a dynamic memory system, drastically reducing identity drift across shots.

Xinlei Yin, Xiulian Peng, Xiao Li +1

Computer Vision Multimodal Models Tool Use & Agents+1

Jun 10, 2026

Microsoft ResearchJun 10, 2026·also Bonn, UIUC

GRIP: Feedback-Guided Prompt Retrieval for Large Multimodal Models

Retrieving the right prompts can boost LMM performance by up to 30%, challenging the assumption that similarity guarantees effectiveness in in-context learning.

Garvita Allabadi, Matteo Sodano, Roberto Estevão +4

Multimodal Models

Microsoft ResearchJun 10, 2026·also Kuaishou, SNU, University of Science and Technology

A Comprehensive Ecosystem for Open-Domain Customized Video Generation

A million-scale dataset for identity-preserving video generation enables a new benchmark that outperforms existing models with minimal parameter overhead.

Jingxu Zhang, Yuqian Hong, Daneul Kim +4

Computer Vision Data Curation & Synthetic Data Multimodal Models

Jun 9, 2026

Microsoft ResearchJun 9, 2026·also CUHK, Oxford, SJTU

3D-CoS: A New 3D Reconstruction Paradigm Based on VLM Code Synthesis

Code-based 3D reconstruction achieves superior edit fidelity and locality, outperforming traditional point-cloud methods in preserving unedited regions.

Yuhao Wang, Puyi Wang, Linjie Li +3

Code Generation & Program Synthesis Multimodal Models

Jun 8, 2026

Microsoft ResearchJun 8, 2026·also Adelaide University, ZJU

Latent Spatial Memory for Video World Models

Latent spatial memory can accelerate video generation by over 10 times while dramatically reducing memory usage, revolutionizing how we model dynamic scenes.

Weijie Wang, Haoyu Zhao, Zeyu Zhang +4

Multimodal Models World Models & Planning

Microsoft ResearchJun 8, 2026·also DAMO, CUHK, Shanghai AI Lab, Shanghai Innovation +1

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

CapRL++ redefines caption quality through utility, enabling models to produce high-fidelity descriptions without the constraints of traditional supervised fine-tuning.

Penghui Yang, Long Xing, Xiaoyi Dong +8

Computer Vision Multimodal Models RLHF & Preference Learning

Microsoft ResearchJun 8, 2026·also University of Louisville

LLM can Read Spectrogram: Encoder-free Speech-Language Modeling

Encoder-free speech modeling can rival traditional methods, challenging the necessity of dedicated speech encoders in LLM architectures.

Ruchao Fan, Yiming Wang, Bo Ren +3

Multimodal Models Speech & Audio

Jun 4, 2026

Microsoft ResearchJun 4, 2026

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

AsyncWebRL achieves a staggering 2.9× increase in training throughput while setting a new state-of-the-art performance for web agents on challenging tasks.

Haoyue Bai, Ruiqi Yang, Chen Ye +3

Multimodal Models Tool Use & Agents Training Efficiency & Optimization

Jun 1, 2026

Microsoft ResearchJun 1, 2026·also Twente

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

OpenWebRL-4B sets a new benchmark for open-source visual web agents, achieving impressive success rates with minimal initial data while outperforming larger-scale competitors.

Qianhui Wu, Yuxi Chen, Hao Bai +6

Data Curation & Synthetic Data Multimodal Models Tool Use & Agents

May 3, 2026

May 3, 2026·also Microsoft Research, Forschungszentrum Jülich GmbH

Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips

Optimizing for runtime in multimodal training can be energy-inefficient, as data movement and overlap on Grace Hopper chips dominate energy consumption, not raw compute.

Mahmoud Ahmed, Sameh Abdulah, Olatunji Ruwase +4

Distributed Systems & Hardware Multimodal Models Training Efficiency & Optimization

May 1, 2026

Microsoft ResearchMay 1, 2026·also Dept. of ECE&ASRI, SNU

Map2World: Segment Map Conditioned Text to 3D World Generation

Forget grid layouts: Map2World lets you generate consistent 3D worlds from arbitrary segment maps, offering unprecedented control and scalability.

Jaeyoung Chung, Suyoung Lee, Jianfeng Xiang +2

Computer Vision Multimodal Models World Models & Planning

Apr 19, 2026

Microsoft ResearchApr 19, 2026·also Fudan, Independent

Transparent and Controllable Recommendation Filtering via Multimodal Multi-Agent Collaboration

A groundbreaking framework reduces false positives in recommendation systems by over 74%, restoring user control and transparency in content curation.

Jiahao Liu, Hansu Gu, Ning Gu +1

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

Apr 14, 2026

CMU MLApr 14, 2026·also Microsoft Research

See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

Iterative visual refinement lets agents navigate dense coding IDEs with superhuman precision, outperforming single-shot methods and paving the way for more reliable software engineering agents.

Himangi Mittal, Gaurav Mittal, Nelson Daniel Troncoso

Computer Vision Multimodal Models Tool Use & Agents

Apr 9, 2026

Microsoft ResearchApr 9, 2026·also MIT CSAIL

From Gaze to Guidance: Interpreting and Adapting to Users'Cognitive Needs with Multimodal Gaze-Aware AI Assistants

Gaze-tracking unlocks a new level of personalized AI assistance, enabling LLMs to infer user cognitive states and boost recall performance.

Valdemar Danry, Javier Hernandez, Andrew D Wilson +3

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing+1

Apr 2, 2026

Microsoft ResearchApr 2, 2026

GeoAI Agency Primitives

GeoAI assistants remain unproductive because they lack a crucial agency layer for iterative human-AI collaboration, a gap this paper addresses with nine core primitives.

Akram Zaytar, Rohan Sawahn, Caleb Robinson +5

Computer Vision Multimodal Models Tool Use & Agents

Apr 2, 2026·also Microsoft Research

DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data

Synthetic motion data, when represented as optical flow, unlocks a new level of realism and control in video diffusion models, surpassing the limitations of real-world datasets.

Wonjoon Jin, J. Won, Janghyeok Han +4

Computer Vision Data Curation & Synthetic Data Multimodal Models

Mar 2, 2026

Microsoft ResearchMar 2, 2026·also UMD

From Pixels to Patches: Pooling Strategies for Earth Embeddings

Ditch mean pooling in your geospatial foundation models: richer pooling methods like GeM can boost accuracy by up to 5% and slash the geographic generalization gap by 40%.

Isaac Corley, Inbal Becker-Reshef, Juan M. Lavista Ferres

Computer Vision Data Curation & Synthetic Data Multimodal Models

Feb 24, 2026

Feb 24, 2026·also Microsoft Research

ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory

Forget slow, reactive GUI agents – ActionEngine uses a state-machine memory to plan actions programmatically, slashing costs by 11.8x and doubling speed while boosting task success to 95%.

Hongbin Zhong, Luis França, Tanakorn Leesatapornwongsa +2

Multimodal Models Tool Use & Agents World Models & Planning

Feb 18, 2025

Microsoft ResearchFeb 18, 2025·also NVIDIA, KAIST, UW-Madison

Magma: A Foundation Model for Multimodal AI Agents

Forget task-specific models: Magma, a single foundation model, now outperforms them in both UI navigation and robotic manipulation by bridging verbal and action abilities.

Jianwei Yang, Reuben Tan, Qianhui Wu +1099

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Search

Microsoft Research