Latticethe structure behind the noise

Papers Digest Topics Selected Labs Collections FAQ

Created by Flynn Lachendro

Papers Digest Topics Labs Saved

Search

Search papers, labs, and topics across Lattice.

Built by Flynn Lachendro·𝕏 / Twitter·RSS··FAQ·Glossary·Privacy

Zhifeng Kong | Lattice

Zhifeng Kong

Papers on Lattice

4

Total citations

88

Topics

7

h-index

10

Publication activitypapers/week, last 8 weeks

Research focus

Multimodal Models (4)Speech & Audio (3)Robotics & Embodied AI (1)World Models & Planning (1)

Frequent co-authors

Arushi Goel (3)Siddharth Gururani (3)Wei Ping (3)Sreyan Ghosh (2)

Papers (4)

Jun 1, 2026

NVIDIA2w ago·also Georgia Tech, HKUST, IIS Academia Sinica, JHU +6

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3 sets a new benchmark for omnimodal models, outperforming existing state-of-the-art in both Text-to-Image and Image-to-Video tasks.

Aditi, Niket Agarwal, Arslan Ali +287

Multimodal Models Robotics & Embodied AI World Models & Planning

May 28, 2026

Tingle Li +83w ago

Benchmarking Single-Factor Physical Video-to-Audio Generation

V2A models prioritize text captions over visual cues when generating audio, resulting in physically plausible but often temporally misaligned sounds.

Tingle Li, Siddharth Gururani, Kevin J. Shih +6

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Apr 13, 2026

NVIDIAApr 13, 2026·also IIT Delhi, Indraprastha Institute of Information, Jaypee Institute of Information, UMD

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Audio-language models can now reason about 30-minute-long audio clips with timestamp-grounded intermediate steps, unlocking a new level of fine-grained understanding.

Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar +17

Multimodal Models Open-Source Models & Weights Speech & Audio

Mar 6, 2025

NVIDIAMar 6, 2025

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

A 3B parameter model, Audio Flamingo 2, now rivals larger proprietary models in audio understanding and reasoning, even handling audio segments up to 5 minutes long.

Sreyan Ghosh, Zhifeng Kong, Sonal Kumar +788

Multimodal Models Reasoning & Chain-of-Thought Speech & Audio

Eval Frameworks & Benchmarks (1)

Open-Source Models & Weights (1)

Kevin J. Shih (2)

Sang-gil Lee (2)

Ming-Yu Liu (2)