Microsoft Research

×Computer Vision

13 papers from Microsoft Research on Computer Vision

May 1, 2026

Map2World: Segment Map Conditioned Text to 3D World Generation

Forget grid layouts: Map2World lets you generate consistent 3D worlds from arbitrary segment maps, offering unprecedented control and scalability.

Jaeyoung Chung, Suyoung Lee, Jianfeng Xiang +2

Computer Vision Multimodal Models World Models & Planning

Apr 30, 2026

Tsinghua AI3w ago·also Microsoft Research

SQuadGen: Generating Simple Quad Layouts via Chart Distance Fields

Simple, artist-friendly quad meshes can now be automatically generated on 3D shapes using a diffusion model trained on a continuous surface representation, sidestepping the complexity of discrete mesh optimization.

Youkang Kong, Yang Liu, Yang Liu +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Apr 15, 2026

Apr 15, 2026·also Microsoft Research, D assets, HKUST

Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data

Edit 3D assets with text prompts while actually preserving the original object's unchanged parts, thanks to a new masking strategy and training dataset.

Yizhao Xu, Hongyuan Zhu, Caiyun Liu +2

Computer Vision Data Curation & Synthetic Data

Apr 14, 2026

University of Science and TechnologyApr 14, 2026·also Microsoft Research, Communication University of China, SJTU

CoD-Lite: Real-Time Diffusion-Based Generative Image Compression

Real-time, lightweight image compression is now possible with diffusion models, thanks to a novel architecture that swaps transformers for convolutions and prioritizes compression-focused pre-training.

Zhaoyang Jia, Naifu Xue, Zihan Zheng +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

CMU MLApr 14, 2026·also Microsoft Research

See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

Iterative visual refinement lets agents navigate dense coding IDEs with superhuman precision, outperforming single-shot methods and paving the way for more reliable software engineering agents.

Himangi Mittal, Gaurav Mittal, Nelson Daniel Troncoso

Computer Vision Multimodal Models Tool Use & Agents

Apr 9, 2026

CMU MLApr 9, 2026·also Microsoft Research

$\oslash$ Source Models Leak What They Shouldn't $\nrightarrow$: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization

Even when source data is protected, source-free domain adaptation leaks knowledge of source-exclusive classes into the target domain, creating a privacy risk that can be mitigated with adversarial unlearning.

Arnav Devalapally, Poornima Jain, Kartik Srinivas +1

Computer Vision Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness

Apr 5, 2026

Microsoft ResearchApr 5, 2026·also D features is important for consistent cross-view reasoning. Finally

DriveVA: Video Action Models are Zero-Shot Drivers

Autonomous driving models can now achieve remarkable zero-shot generalization by leveraging the power of large-scale video generation models to jointly predict future actions and visuals.

Mengmeng Liu, Diankun Zhang, Jiuming Liu +5

Computer Vision Robotics & Embodied AI World Models & Planning

Apr 2, 2026

Microsoft ResearchApr 2, 2026

GeoAI Agency Primitives

GeoAI assistants remain unproductive because they lack a crucial agency layer for iterative human-AI collaboration, a gap this paper addresses with nine core primitives.

Akram Zaytar, Rohan Sawahn, Caleb Robinson +5

Computer Vision Multimodal Models Tool Use & Agents

Apr 2, 2026·also Microsoft Research

DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data

Synthetic motion data, when represented as optical flow, unlocks a new level of realism and control in video diffusion models, surpassing the limitations of real-world datasets.

Wonjoon Jin, J. Won, Janghyeok Han +4

Computer Vision Data Curation & Synthetic Data Multimodal Models

Mar 2, 2026

Microsoft ResearchMar 2, 2026·also UMD

From Pixels to Patches: Pooling Strategies for Earth Embeddings

Ditch mean pooling in your geospatial foundation models: richer pooling methods like GeM can boost accuracy by up to 5% and slash the geographic generalization gap by 40%.

Isaac Corley, Inbal Becker-Reshef, Juan M. Lavista Ferres

Computer Vision Data Curation & Synthetic Data Multimodal Models

Feb 26, 2026

Microsoft ResearchFeb 26, 2026·also Tsinghua AI, Beihang, CAS, Shanghai AI Lab +1

MoDora: Tree-Based Semi-Structured Document Analysis System

LLMs can now more accurately answer questions on complex documents thanks to a new system that understands layout and hierarchical relationships between document components.

Bangrui Xu, Qihang Yao, Qihang Yao +10

Computer Vision Natural Language Processing Recommendation & Information Retrieval

Feb 25, 2026

Microsoft ResearchFeb 25, 2026

HybridINR-PCGC: Hybrid Lossless Point Cloud Geometry Compression Bridging Pretrained Model and Implicit Neural Representation

Achieve up to 57% better point cloud compression by combining the generalization of pretrained models with the robustness of implicit neural representations.

Wenjie Huang, Shuting Xia, He Huang +1

Computer Vision Inference & Quantization

Feb 23, 2026

Microsoft ResearchFeb 23, 2026·also Google Research, UW, AI for Good, Department of Biochemistry Institute for Protein +3

Satellite-Based Detection of Looted Archaeological Sites Using Machine Learning

ImageNet-pretrained CNNs can spot looted archaeological sites from space with surprising accuracy, leaving traditional methods in the dust.

Girmaw Abebe Tadesse, Titien Bartette, Andrew Hassanali +6

Computer Vision Data Curation & Synthetic Data Scientific Discovery & Drug Design

Search

Microsoft Research