NVIDIA Research

×Architecture Design (Transformers, SSMs, MoE)

14 papers from NVIDIA Research on Architecture Design (Transformers, SSMs, MoE)

May 5, 2026

The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance

Forget assuming NaNs and single-bit flips are the main culprits in GPU silent data corruption; this study reveals they're surprisingly rare, demanding a rethink of fault modeling.

Chung-Hsuan Tung, Yanxiang Huang, N. Saxena +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Apr 28, 2026

Stanford HAI3w ago·also NVIDIA, Univeristy of Illinois Urbana Champaign

Recursive Multi-Agent Systems

Looping language models isn't just for single agents anymore: Recursive Multi-Agent Systems (RecursiveMAS) show that agent collaboration itself can be scaled through recursion, yielding faster and more efficient problem-solving.

Xiyuan Yang, Jiaru Zou, Rui Pan +8

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Tool Use & Agents

3w ago·also NVIDIA, Columbia, Samsung Semiconductor, Yonsei

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

Forget GPU-centric designs: AMMA slashes attention latency by 15x and energy consumption by 7x with a memory-centric architecture for long-context LLMs.

Zhongkai Yu, Haotian Ye, Haotian Ye +12

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Apr 27, 2026

NVIDIA3w ago·also Amazon Science, Microsoft Research, UW, Music X Lab +1

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.

Nvidia Amala Sanjay Deshmukh, K. Chumachenko, Tuomas Rintamaki +209

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

NVIDIA3w ago

MotionBricks: Scalable Real-Time Motions with Modular Latent Generative Model and Smart Primitives

Forget clunky animation pipelines – MotionBricks lets you assemble real-time, high-quality character motions like LEGOs, even controlling robots.

Tingwu Wang, Olivier Dionne, Mick Ruyter +13

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

Apr 22, 2026

NVIDIAApr 22, 2026

SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

Open-vocabulary 3D instance segmentation just got 100x faster, thanks to a new transformer architecture that ditches region proposals and fragmented masks.

C. Choy, Junha Lee, Chunghyun Park +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

Apr 20, 2026

Apr 20, 2026·also NVIDIA

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

Squeeze up to 3.2x more performance from your long-context LLMs by intelligently splitting attention computation between CPU and GPU.

Mao Lin, Mao Lin, Guilherme Cox +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Apr 19, 2026

NVIDIAApr 19, 2026·also NTU Taiwan

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

Speech-to-speech translation can now convey laughter and tears with human-like fidelity, thanks to a surprisingly data-efficient approach leveraging LoRA experts.

Szu-Chi Chen, I-Ning Tsai, Yi-Cheng Lin +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Apr 14, 2026

AI2Apr 14, 2026·also NVIDIA, UT Austin, Waterloo, ZJU

Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Nemotron 3 Super proves you can achieve comparable accuracy to existing 120B models, but with significantly higher inference throughput, by combining Mamba, Attention, and Mixture-of-Experts.

Aakshita Chandiramani, Aaron Blakeman, Abdullahi Olaoye +463

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Tool Use & Agents

Apr 1, 2026

NVIDIAApr 1, 2026·also Università della Svizzera italiana

Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction

Gaussian Splatting gets a high-frequency boost: Neural Harmonic Textures unlock significantly more detail in primitive-based 3D reconstructions without sacrificing speed.

Jorge Condor, Nicolas Moenne-Loccoz, Merlin Nimier-David +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Mar 17, 2026

NVIDIAMar 17, 2026·also BIT, BOSS Zhipin, ByteDance, Tencent AI +1

HierarchicalKV: A GPU Hash Table with Cache Semantics for Continuous Online Embedding Storage

Stop wasting precious GPU memory: this new cache-semantic hash table library achieves up to 3.9 billion key-value lookups per second, outperforming standard approaches by up to 9.4x.

Haidong Rong, Jiashu Yao, Matthias Langer +11

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Mar 8, 2026

NVIDIAMar 8, 2026·also Tongji

Scalable Training of Mixture-of-Experts Models with Megatron Core

Training trillion-parameter Mixture-of-Experts models just got a whole lot faster: Megatron Core now achieves >1 PFLOP/GPU on NVIDIA's latest hardware.

Zijie Yan, Hongxiao Bai, Xin Yao +34

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Feb 17, 2026

NVIDIAFeb 17, 2026·also Technion

Spanning the Visual Analogy Space with a Weight Basis of LoRAs

Forget monolithic LoRAs: LoRWeB dynamically mixes a basis set of LoRAs to unlock SOTA generalization in visual analogy tasks.

Hila Manor, Hila Manor, Rinon Gal +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models+1

Feb 16, 2026

NVIDIAFeb 16, 2026·also D sequence before optimizing the TTT objective Eq. 3. Intuitively

Depth Completion as Parameter-Efficient Test-Time Adaptation

Achieve state-of-the-art depth completion by adapting 3D foundation models at test time with minimal parameter updates, outperforming task-specific encoders that often overfit.

Bingxin Ke, Qunjie Zhou, Jiahui Huang +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Search

NVIDIA Research