Search papers, labs, and topics across Lattice.
Nemotron 3 Nano Omni, a multimodal model supporting audio, text, images, and video, improves upon its predecessor, Nemotron Nano V2 VL, via architectural enhancements, refined training data, and optimized training recipes. The model achieves state-of-the-art results in document understanding, long audio-video comprehension, and agentic computer use. By leveraging multimodal token-reduction techniques on the Nemotron 3 Nano 30B-A3B backbone, the model achieves lower inference latency and higher throughput compared to similar-sized models.
Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.
We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in architecture, training data and recipes. In particular, Nemotron 3 delivers leading results in real-world document understanding, long audio-video comprehension, and agentic computer use. Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni further incorporates innovative multimodal token-reduction techniques to deliver substantially lower inference latency and higher throughput than other models of similar size. We are releasing model checkpoints in BF16, FP8, and FP4 formats, along with portions of the training data and codebase to facilitate further research and development.