NVIDIA Research

×Computer Vision

33 papers from NVIDIA Research on Computer Vision

Jul 6, 2026

1w ago·also NVIDIA

Clustered Codebook Quantization for 2D Gaussian-based Image Compression

CGVQ achieves a remarkable 20% reduction in bits per pixel while maintaining visual quality, revolutionizing Gaussian-based image compression.

Runze Cheng, Yicheng Zhan, Josef Spjut +1

Computer Vision Inference & Quantization

Jul 5, 2026

Tsinghua AI1w ago·also NVIDIA, TU Darmstadt, TU Munich

How to Build Digital Humans? From Priors to Photorealistic Avatars

Current avatar systems are more diverse than ever, yet foundational prior learning is often overlooked in discussions of photorealistic digital humans.

Wojciech Zielonka, Tobias Kirschstein, Timo Bolkart +8

Computer Vision Multimodal Models

Jul 2, 2026

NVIDIA1w ago·also Keio

AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models

Current VLMs struggle with specialized domains, failing to adapt effectively in both zero-shot and ICL scenarios, revealing critical gaps in their spatio-temporal reasoning abilities.

Rintaro Otsubo, Ryo Fujii, Reina Ishikawa +7

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Jun 29, 2026

NVIDIA2w ago

Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

Dynamic token editing in image synthesis could redefine how we approach high-resolution generative models.

Shufan Li, Greg Heinrich, Hanrong Ye +3

Computer Vision Multimodal Models

Jun 25, 2026

NVIDIA2w ago·also D Vision (, D)

Extracting Neural Materials from Multi-view Images

NeuMatEx outperforms PBR techniques by extracting complex neural materials with unprecedented visual fidelity and precision from multi-view images.

Kim Youwang, Jon Hasselgren, Peter Kocsis +3

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Jun 22, 2026

KlingAI Research3w ago·also NVIDIA, Tsinghua AI, BIT, USTC +1

ScalingAttention: Discovering Intrinsic Sparse Attention Topology for Video Diffusion Transformers

Achieving up to 1.90X speedup in video generation without sacrificing fidelity, ScalingAttention redefines efficiency in Diffusion Transformers.

Ruiliang Zhou, Xuecheng Wu, Kang He +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Jun 18, 2026

3w ago·also NVIDIA, UofT

FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows

Self-correcting models can achieve unprecedented fidelity and plausibility in generative tasks by actively learning from their own alignment errors.

Daniel Gilo, Sven Elflein, Ido Sobol +1

Computer Vision Multimodal Models

Jun 16, 2026

3w ago·also NVIDIA, UIUC

Data-Forcing Distillation: Restoring Diversity and Fidelity in Few-Step Video Generation

A single-line code change can restore diversity and fidelity in video generation models, outperforming even the original teacher models.

Siyi Chen, Shaowei Liu, Yixuan Jia +4

Computer Vision

3w ago·also NVIDIA

A Hybrid Optimization Framework for Grasp Synthesis under Partial Observations

Combining learning and geometric optimization, this framework achieves a 60.9% grasp success rate, outperforming traditional methods by a significant margin.

Wenzheng Zhang, Fahira Afzal Maken, Tin Lai +1

Computer Vision Robotics & Embodied AI

Jun 11, 2026

NVIDIAJun 11, 2026

Fully Distributed Multi-View 3D Tracking in Real-Time

Real-time multi-view 3D tracking can now be achieved at scale without the computational burdens of centralized systems, thanks to a fully distributed approach.

Byron Hernandez, Fangyu Li, Aotian Wu +3

Computer Vision Distributed Systems & Hardware

May 31, 2026

NVIDIAMay 31, 2026·also D and, Princeton

GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping

Achieving zero-shot generalization in robotic grasping across diverse gripper designs could revolutionize how robots interact with their environments.

Beining Han, Clemens Eppner, Balakumar Sundaralingam +3

Computer Vision Robotics & Embodied AI

May 28, 2026

NVIDIAMay 28, 2026·also Beihang, HKU, UCSD, University of California

Grounded 3D-Aware Spatial Vision-Language Modeling

Grounding boosts spatial reasoning in VLMs: explicitly linking language to 2D and 3D scene elements lets models decompose complex spatial problems and improve performance even on non-grounded tasks.

An-Chieh Cheng, Yang Fu, Yang Fu +21

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

NVIDIAMay 28, 2026·also Apple ML, ETH, D tracks to, UofT

D\'ej\`a View: Looping Transformers for Multi-View 3D Reconstruction

Forget scaling laws: a single looped transformer block, iterated explicitly, crushes billion-parameter feed-forward networks at multi-view 3D reconstruction.

Alessandro Burzio, Tobias Fischer, Sven Elflein +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

May 26, 2026

NVIDIAMay 26, 2026·also Tsinghua AI, Edinburgh, HKUST, NVAITC +1

Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models

Masking just 5% of attention heads in vision-language models tanks performance on long-context tasks, revealing a surprisingly sparse and critical set of "multimodal retrieval heads" that attend to both text and images.

Aaron Branson Cigres Li, Yu Zhao, Yiming Du +6

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

NVIDIAMay 26, 2026·also PI, UPenn

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

Get up to 1.79x faster ViT inference on high-resolution images without sacrificing accuracy by surgically replacing full-attention blocks with cheaper alternatives *after* pre-training.

Dongyun Zou, Zhuoyang Zhang, Wenkun He +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

NVIDIAMay 26, 2026

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Ditch slow, token-by-token box generation: LocateAnything's Parallel Box Decoding (PBD) boosts VLM grounding speed and accuracy by decoding entire bounding boxes at once.

Shihao Wang, Shilong Liu, Yu Kuang +11

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

May 25, 2026

May 25, 2026·also NVIDIA, Harvard, Physion Labs

STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models

Ditch the clunky tool-use pipelines: STORM teaches video-language models to reason about space and time using *internalized* latent trajectories, slashing inference costs while boosting accuracy.

Yiming Liang, Yixiao Chen, Yiyang Zhou +5

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

NVIDIAMay 25, 2026·also Beihang

F-RNG: Feed-Forward Relightable Neural Gaussians

Relight 3D assets 25x faster with a feed-forward network that distills relightable representations from large reconstruction models, sidestepping expensive per-scene optimization.

Guangming Fu, Jiahui Fan, Jian Yang +2

Computer Vision

Apr 28, 2026

Apr 28, 2026·also NVIDIA

8DNA: 8D Neural Asset Light Transport by Distribution Learning

Near-field lighting? No problem: 8DNA pre-bakes complex light transport into neural representations, outperforming prior methods with faster inference and lower training costs.

Liwen Wu, Haolin Lu, Bing Xu +4

Computer Vision

Apr 27, 2026

NVIDIAApr 27, 2026

MotionBricks: Scalable Real-Time Motions with Modular Latent Generative Model and Smart Primitives

Forget clunky animation pipelines – MotionBricks lets you assemble real-time, high-quality character motions like LEGOs, even controlling robots.

Olivier Dionne, Mick Ruyter, David Minor +11

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

Apr 22, 2026

NVIDIAApr 22, 2026

SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

Open-vocabulary 3D instance segmentation just got 100x faster, thanks to a new transformer architecture that ditches region proposals and fragmented masks.

C. Choy, Junha Lee, Chunghyun Park +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

Apr 9, 2026

Apr 9, 2026·also CMU ML, NVIDIA, Telecom

DP-DeGauss: Dynamic Probabilistic Gaussian Decomposition for Egocentric 4D Scene Reconstruction

Finally, a method disentangles dynamic egocentric scenes into background, hand, and object components, enabling fine-grained understanding and editing.

Tingxi Chen, Ting-Hsuan Chen, Zhengxue Cheng +4

Computer Vision Multimodal Models Robotics & Embodied AI

Apr 8, 2026

NVIDIAApr 8, 2026·also Emory, Harvard, JHU

Distilling Photon-Counting CT into Routine Chest CT through Clinically Validated Degradation Modeling

Unlock the power of cutting-edge photon-counting CT imaging on your existing routine chest CT scans, boosting lesion detection by 10-15%.

Xinze Zhou, Wenxuan Li, Scott Ye +8

Computer Vision Data Curation & Synthetic Data Inference & Quantization

NVIDIAApr 8, 2026·also UIUC, UofT

MoRight: Motion Control Done Right

Finally, a video generation model lets you puppeteer objects and their reactions independently, all while freely moving the camera.

Shaowei Liu, Xuanchi Ren, Tianchang Shen +4

Computer Vision Multimodal Models Robotics & Embodied AI+1

Apr 6, 2026

Apr 6, 2026·also NVIDIA, UW, Cisco Research

GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads

Serving both image and video diffusion models on the same hardware? GENSERVE's step-level preemption and dynamic resource allocation can boost your service level agreement (SLA) attainment by up to 44%.

Zhangke Li, Triston Cao, Myungjin Lee

Computer Vision Distributed Systems & Hardware Inference & Quantization

Apr 1, 2026

NVIDIAApr 1, 2026

Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction

Gaussian Splatting gets a high-frequency boost: Neural Harmonic Textures unlock significantly more detail in primitive-based 3D reconstructions without sacrificing speed.

Jorge Condor, Nicolas Moenne-Loccoz, Merlin Nimier-David +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Mar 30, 2026

NVIDIAMar 30, 2026·also ANU

\textit{4DSurf}: High-Fidelity Dynamic Scene Surface Reconstruction

Achieve 49% and 19% better Chamfer distance than state-of-the-art dynamic surface reconstruction methods on Hi4D and CMU Panoptic datasets, respectively, by enforcing temporal consistency in Gaussian Splatting.

Jose M. Alvarez

Computer Vision Robotics & Embodied AI

Mar 17, 2026

IdealworksMar 17, 2026·also NVIDIA, NSFC

Industrial cuVSLAM Benchmark&Integration

A hybrid cuVSLAM-based visual SLAM system achieves superior mapping accuracy in real-world logistics environments, outperforming other VO/VSLAM approaches.

Charbel Abi Hana, Kameel Amareen, M. Mostafa +4

Computer Vision Eval Frameworks & Benchmarks Robotics & Embodied AI

Feb 17, 2026

NVIDIAFeb 17, 2026·also Technion

Spanning the Visual Analogy Space with a Weight Basis of LoRAs

Forget monolithic LoRAs: LoRWeB dynamically mixes a basis set of LoRAs to unlock SOTA generalization in visual analogy tasks.

Hila Manor, Hila Manor, Rinon Gal +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models+1

Feb 16, 2026

NVIDIAFeb 16, 2026·also UofT

Depth Completion as Parameter-Efficient Test-Time Adaptation

Achieve state-of-the-art depth completion by adapting 3D foundation models at test time with minimal parameter updates, outperforming task-specific encoders that often overfit.

Bingxin Ke, Jiahui Huang, Xuanchi Ren +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Feb 10, 2026

NVIDIAFeb 10, 2026

ArtisanGS: Interactive Tools for Gaussian Splat Selection with AI and Human in the Loop

Forget tedious manual segmentation: ArtisanGS lets you lasso objects in 3D Gaussian Splats with AI-powered 2D selections that propagate into 3D, giving you unprecedented control over editing.

Clement Fuji Tsang, Anita Hu, Or Perel +2

Computer Vision Robotics & Embodied AI Tool Use & Agents

Nov 16, 2025

NVIDIANov 16, 2025·also D data while Ours only to the text, Technion

Appreciate the View: A Task-Aware Evaluation Framework for Novel View Synthesis

Current NVS evaluation metrics are misleading, so this paper introduces a task-aware framework using Zero123 features that actually aligns with human perception of quality and faithfulness.

Saar Stern, Ido Sobol, O. Litany

Computer Vision Eval Frameworks & Benchmarks

Feb 11, 2025

NVIDIAFeb 11, 2025·also IBM Research, Technion

Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips

Flipping just *two* sign bits in a large neural network can obliterate its performance, revealing a surprising fragility in even state-of-the-art models.

Ido Galil, Ido Galil, M. Kimhi +3

Computer Vision Red-Teaming & Adversarial Robustness