Bryan Catanzaro

Audex achieves state-of-the-art audio understanding and generation while maintaining the reasoning prowess of its text-only foundation, all through a unified architecture.

Zhifeng Kong, Sang-gil Lee, JaeHyeon Kim +17

Multimodal Models Speech & Audio

Jun 25, 2026

Fitsum Reda +5Jun 25, 2026·also NVIDIA

Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context

Achieving 2.42X faster generation without sacrificing quality, Nemotron-TwoTower redefines the efficiency of language modeling.

Fitsum Reda, John Kamalu, Roger Waleffe +3

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

Jun 12, 2026

AI2Jun 12, 2026·also NVIDIA, Gusu Laboratory of Materials, HKUST, NJU +5

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Achieving six times the inference throughput of current LLMs while maintaining accuracy, Nemotron 3 Ultra redefines performance benchmarks for agentic reasoning tasks.

NVIDIA, Aaron Blakeman, Aaron Thomas +554

Architecture Design (Transformers, SSMs, MoE)Scaling Laws & Emergent Abilities Tool Use & Agents

Apr 27, 2026

NVIDIAApr 27, 2026·also Amazon Science, Microsoft Research, UW, Music X Lab +1

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.

Nvidia Amala Sanjay Deshmukh, K. Chumachenko, Tuomas Rintamaki +208

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Apr 13, 2026

NVIDIAApr 13, 2026·also IIT Delhi, Indraprastha Institute of Information, Jaypee Institute of Information, UMD

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Audio-language models can now reason about 30-minute-long audio clips with timestamp-grounded intermediate steps, unlocking a new level of fine-grained understanding.

Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar +17

Multimodal Models Open-Source Models & Weights Speech & Audio

Mar 19, 2026

NVIDIAMar 19, 2026·also HKUST, Samsung, Waterloo

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

A 30B MoE model can now achieve Gold Medal-level performance in IMO, IOI, and ICPC, rivaling frontier models with 20x more parameters.

Zhuoling Yang, Zhuolin Yang, Yang Chen +23

Code Generation & Program Synthesis Reasoning & Chain-of-Thought RLHF & Preference Learning

Feb 24, 2026

NVIDIAFeb 24, 2026

On Data Engineering for Scaling LLM Terminal Capabilities

Forget hand-crafted datasets: a new synthetic data pipeline lets smaller LLMs beat giants at terminal tasks.

Renjie Pi, Grace Lam, Mohammad Shoeybi +5

Data Curation & Synthetic Data Tool Use & Agents Training Efficiency & Optimization

Mar 6, 2025

NVIDIAMar 6, 2025

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

A 3B parameter model, Audio Flamingo 2, now rivals larger proprietary models in audio understanding and reasoning, even handling audio segments up to 5 minutes long.

Sreyan Ghosh, Zhifeng Kong, Sonal Kumar +788

Multimodal Models Reasoning & Chain-of-Thought Speech & Audio

Search

Bryan Catanzaro

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (9)