Sreyan Ghosh

Audex achieves state-of-the-art audio understanding and generation while maintaining the reasoning prowess of its text-only foundation, all through a unified architecture.

Zhifeng Kong, Sang-gil Lee, JaeHyeon Kim +17

Multimodal Models Speech & Audio

Jun 16, 2026

A Closer Look at Failure Modes in Temporal Understanding of Large Audio-Language Models

LALMs can boost their temporal reasoning accuracy by 3.2% simply by better redistributing attention across audio tokens rather than relying on textual cues.

Apoorva Kulkarni, Kaousheik Jayakumar, Sreyan Ghosh +3

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Speech & Audio

Jun 1, 2026

NVIDIAJun 1, 2026·also BAIR, Galbot, Georgia Tech, HKUST +9

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3 sets a new benchmark for omnimodal models, outperforming existing state-of-the-art in both Text-to-Image and Image-to-Video tasks.

Aditi, Niket Agarwal, Arslan Ali +285

Multimodal Models Robotics & Embodied AI World Models & Planning

Apr 27, 2026

NVIDIAApr 27, 2026·also Amazon Science, Microsoft Research, UW, Music X Lab +1

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.

Nvidia Amala Sanjay Deshmukh, K. Chumachenko, Tuomas Rintamaki +208

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Apr 19, 2026

Apr 19, 2026·also Dolby Laboratories

Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation

Generate semantically aligned, high-fidelity music for videos with unprecedented speed and control by combining autoregressive planning and diffusion.

Vaibhavi Lokegaonkar, Aryan Vijay Bhosale, Vishnu Raj +5

Multimodal Models Speech & Audio

Apr 13, 2026

NVIDIAApr 13, 2026·also IIT Delhi, Indraprastha Institute of Information, Jaypee Institute of Information, UMD

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Audio-language models can now reason about 30-minute-long audio clips with timestamp-grounded intermediate steps, unlocking a new level of fine-grained understanding.

Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar +17

Multimodal Models Open-Source Models & Weights Speech & Audio

Apr 3, 2026

Ramaneswaran Selvakumar +5Apr 3, 2026

Do Audio-Visual Large Language Models Really See and Hear?

AVLLMs may "hear" at intermediate layers, but they largely ignore audio cues in favor of vision when generating text, revealing a fundamental modality bias.

Ramaneswaran Selvakumar, Kaousheik Jayakumar, S. Sakshi +3

Interpretability & Mechanistic Interp Multimodal Models Speech & Audio

Search

Sreyan Ghosh

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (8)