
NVIDIA Research
Research division of NVIDIA focusing on GPU-accelerated AI, computer graphics, robotics, and autonomous systems.
www.nvidia.com11
143
13
Top Researchers
Recent Papers
The authors extend the Puzzle post-training neural architecture search framework to optimize the gpt-oss-120B model, creating gpt-oss-puzzle-88B, by combining heterogeneous MoE expert pruning, selective attention replacement, FP8 quantization, and post-training reinforcement learning. This optimized model achieves significant per-token throughput speedups (up to 2.82X on a single H100 GPU) while maintaining or slightly exceeding the parent model's accuracy across various benchmarks. The paper advocates for request-level efficiency metrics to account for varying token counts and demonstrates that gpt-oss-puzzle-88B improves request-level efficiency by up to 1.29X.
Introduces a pipeline combining heterogeneous MoE expert pruning, selective attention replacement, FP8 quantization, and post-training reinforcement learning within the Puzzle framework to optimize large language models for inference.
The paper introduces Fun-DDPS, a generative framework for carbon capture and storage (CCS) modeling that combines function-space diffusion models with differentiable neural operator surrogates for both forward and inverse problems. By decoupling the learning of a prior over geological parameters from the physics-consistent guidance provided by a Local Neural Operator (LNO) surrogate, Fun-DDPS effectively handles data sparsity and ensures physically realistic solutions. Experiments on synthetic CCS datasets demonstrate that Fun-DDPS significantly outperforms standard surrogates in forward modeling with sparse observations and achieves comparable accuracy to rejection sampling in inverse modeling, while also generating physically consistent realizations with improved sample efficiency.
Introduces a function-space decoupled diffusion framework (Fun-DDPS) that improves both the accuracy and physical realism of forward and inverse modeling in carbon capture and storage.
This paper introduces Flow-Guided Neural Operator (FGNO), a self-supervised learning framework for time-series data that leverages flow matching to dynamically adjust the corruption level during training. FGNO uses Short-Time Fourier Transform to handle varying time resolutions and extracts hierarchical features by applying different levels of noise through network layers and flow times. By training with noisy inputs but extracting representations from clean inputs, FGNO achieves state-of-the-art performance across multiple biomedical time-series tasks, demonstrating robustness to data scarcity and improved representation learning.
Introduces Flow-Guided Neural Operator (FGNO), a novel self-supervised learning framework that dynamically adjusts corruption levels during training using flow matching and extracts representations from clean inputs.
The paper introduces ArtisanGS, an interactive tool suite for selecting and segmenting 3D Gaussian Splats (3DGS) to enable controllable editing of in-the-wild captures. It presents a fast AI-driven method for propagating user-guided 2D selection masks to 3DGS selections, supplemented by manual selection and segmentation tools for user intervention. The toolset's utility is demonstrated through user-guided local editing using a custom Video Diffusion Model, achieving binary segmentation of unstructured 3DGS scenes without additional optimization.
Introduces an interactive tool suite, ArtisanGS, for versatile Gaussian Splat selection and segmentation, enabling user-guided editing via a novel AI-driven propagation method and manual tools.
The paper introduces DuoGen, a general-purpose interleaved multimodal generation framework designed to improve the quality of models generating interleaved image and text sequences under general instructions. DuoGen constructs a large-scale instruction-tuning dataset from curated websites and synthetic examples and employs a two-stage decoupled training strategy using a pretrained multimodal LLM and a diffusion transformer (DiT). Experiments demonstrate that DuoGen outperforms existing open-source models in text quality, image fidelity, and image-context alignment, achieving state-of-the-art performance in text-to-image generation and image editing.
Introduces a two-stage decoupled training strategy for interleaved multimodal generation that combines a pretrained multimodal LLM for instruction understanding with a diffusion transformer (DiT) for image generation.
This paper evaluates the robustness of ten publicly available LLM safety guardrail models from major tech companies against 1,445 adversarial prompts across 21 attack categories. The study reveals a significant performance drop in all models when tested on novel, unseen prompts compared to public benchmarks, indicating potential training data contamination. A novel "helpful mode" jailbreak was also discovered in two models, where they generated harmful content instead of blocking it.
Demonstrates that current LLM safety guardrail models exhibit poor generalization to novel adversarial attacks, highlighting the limitations of relying solely on benchmark performance for evaluation.
The authors introduce Cosmos-Predict2.5, a flow-based video foundation model for physical AI that unifies Text2World, Image2World, and Video2World generation, leveraging a vision-language model for improved text grounding. Trained on 200M video clips and refined with reinforcement learning, Cosmos-Predict2.5 demonstrates significant improvements in video quality and instruction alignment compared to its predecessor, with models released at 2B and 14B scales. They also present Cosmos-Transfer2.5, a control-net style framework for Sim2Real and Real2Real world translation, achieving higher fidelity and robust long-horizon video generation despite being smaller than Cosmos-Transfer1.
Introduces a unified video foundation model, Cosmos-Predict2.5, and a Sim2Real/Real2Real translation framework, Cosmos-Transfer2.5, for scaling embodied intelligence through improved video generation and instruction alignment.
This work integrates small-molecule high-throughput screening with a deep-learning-based virtual screening approach to uncover new antibacterial compounds, illustrating a 90-fold improved hit rate over the high-throughput screening experiment used for training.
This work integrates small-molecule high-throughput screening with a deep-learning-based virtual screening approach to uncover new antibacterial compounds, illustrating a 90-fold improved hit rate over the high-throughput screening experiment used for training.
This paper introduces a generative predictive control (GPC) framework that leverages conditional flow-matching models to amortize sampling-based model predictive control (SPC) for contact-rich manipulation. By training these flow-matching models on SPC control sequences generated in simulation, the method learns proposal distributions that enable more efficient and informed sampling during online planning compared to methods relying on iterative refinement or gradient-based solvers. The approach is validated through extensive experiments in simulation and on a quadruped robot performing real-world loco-manipulation, demonstrating improved sample efficiency, reduced planning horizon requirements, and robust generalization.
Demonstrates that conditional flow-matching models can be effectively trained on noisy SPC data to generate meaningful proposal distributions, enabling efficient and robust online planning for contact-rich manipulation.
The authors introduce AtomWorks, a data framework designed to streamline the development of biomolecular foundation models for tasks like structure prediction and protein design. Using AtomWorks, they trained RosettaFold-3 (RF3), a structure prediction network that improves chirality handling, leading to performance closer to AlphaFold3. The release of AtomWorks, training data, and RF3 model weights under a BSD license aims to accelerate open-source biomolecular machine learning research.
Introduces AtomWorks, a comprehensive data framework, and leverages it to train RF3, a structure prediction network with enhanced chirality treatment, bridging the performance gap with closed-source models.
Audio Flamingo 2 (AF2) is introduced as an Audio-Language Model (ALM) that enhances audio understanding and reasoning by utilizing a custom CLAP model, synthetic Audio QA data, and a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance on over 20 benchmarks with a 3B parameter model, outperforming larger models. The work also introduces LongAudio, a new dataset for training ALMs on long audio segments (30 secs to 5 mins), and demonstrates exceptional performance on the LongAudioBench benchmark after fine-tuning AF2.
Introduces Audio Flamingo 2, an ALM with enhanced audio understanding and reasoning capabilities, and the LongAudio dataset and benchmark for long audio understanding.

