NVIDIA Research

×Multimodal Models

10 papers from NVIDIA Research on Multimodal Models

Apr 30, 2026

NVIDIA3w ago·also National Center for Childhood Diabetes, Pheno.AI, Schneider Children's Medical Center of Israel, TAU +3

Simulating clinical interventions with a generative multimodal model of human physiology

A generative model of human physiology not only beats existing clinical risk scores at predicting disease, but also accurately simulates the effects of clinical interventions, paving the way for personalized medicine.

Guy Lutsker, Gal Sapir, G. Sapir +12

Multimodal Models Scientific Discovery & Drug Design World Models & Planning

Apr 29, 2026

3w ago·also NVIDIA

Three-Step Nav: A Hierarchical Global-Local Planner for Zero-Shot Vision-and-Language Navigation

VLN agents can navigate more accurately in zero-shot settings by "looking forward, now, and backward," mimicking human navigational strategies.

Wanrong Zheng, Yunhao Ge, Laurent Itti

Multimodal Models Robotics & Embodied AI World Models & Planning

Apr 27, 2026

NVIDIA3w ago·also Amazon Science, Microsoft Research, UW, Music X Lab +1

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.

Nvidia Amala Sanjay Deshmukh, K. Chumachenko, Tuomas Rintamaki +209

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Apr 13, 2026

NVIDIAApr 13, 2026·also IIT Delhi, Indraprastha Institute of Information, Jaypee Institute of Information, UMD

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Audio-language models can now reason about 30-minute-long audio clips with timestamp-grounded intermediate steps, unlocking a new level of fine-grained understanding.

Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar +17

Multimodal Models Open-Source Models & Weights Speech & Audio

Apr 9, 2026

NVIDIAApr 9, 2026·also SJTU

DP-DeGauss: Dynamic Probabilistic Gaussian Decomposition for Egocentric 4D Scene Reconstruction

Finally, a method disentangles dynamic egocentric scenes into background, hand, and object components, enabling fine-grained understanding and editing.

Ting-Hsuan Chen, Tingxi Chen, Zhengxue Cheng +5

Computer Vision Multimodal Models Robotics & Embodied AI

Apr 8, 2026

NVIDIAApr 8, 2026·also HKU, MBZUAI

Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

Swap out slow, one-token-at-a-time generation in VLMs for a 6x speed boost, without sacrificing quality, using a surprisingly simple direct conversion to block-diffusion decoding.

Chengyue Wu, Shiyi Lan, Yonggan Fu +8

Inference & Quantization Multimodal Models Robotics & Embodied AI

NVIDIAApr 8, 2026·also UIUC

MoRight: Motion Control Done Right

Finally, a video generation model lets you puppeteer objects and their reactions independently, all while freely moving the camera.

Shaowei Liu, Xuanchi Ren, Tianchang Shen +4

Computer Vision Multimodal Models Robotics & Embodied AI+1

Mar 4, 2026

CMU MLMar 4, 2026·also BAIR, MIT CSAIL, NVIDIA, Tsinghua AI +11

ManipulationNet: An Infrastructure for Benchmarking Real-World Robot Manipulation with Physical Skill Challenges and Embodied Multimodal Reasoning

Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.

Kenny Kimble, Kenneth Kimble, Edward H. Adelson +23

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Feb 17, 2026

NVIDIAFeb 17, 2026·also Technion

Spanning the Visual Analogy Space with a Weight Basis of LoRAs

Forget monolithic LoRAs: LoRWeB dynamically mixes a basis set of LoRAs to unlock SOTA generalization in visual analogy tasks.

Hila Manor, Hila Manor, Rinon Gal +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models+1

Oct 28, 2025

NVIDIAOct 28, 2025·also BUPT, Cohere, Georgia Tech, KAIST +5

World Simulation with Video Foundation Models for Physical AI

Forget synthetic data that looks like it came from a PS2 game: NVIDIA's new Cosmos-Predict2.5 generates high-fidelity videos for training embodied AI, opening the door to more realistic and reliable simulations.

Nvidia Arslan Ali, Junjie Bai, Maciej Bala +8536

Multimodal Models Robotics & Embodied AI World Models & Planning

Search

NVIDIA Research