Shihao Wang

Ditch slow, token-by-token box generation: LocateAnything's Parallel Box Decoding (PBD) boosts VLM grounding speed and accuracy by decoding entire bounding boxes at once.

Shihao Wang, Shilong Liu, Yu Kuang +11

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Apr 27, 2026

NVIDIAApr 27, 2026·also Amazon Science, Microsoft Research, UW, Music X Lab +1

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.

Nvidia Amala Sanjay Deshmukh, K. Chumachenko, Tuomas Rintamaki +208

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Mar 5, 2026

NVIDIAMar 5, 2026·also AgiBot, Shanghai AI Lab

Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

Current multimodal LLMs choke on long-form video understanding, either forgetting details or getting lost in the timeline, but a new agentic architecture with dynamic memory management offers a promising fix.

Guo Chen, Lidong Lu, Yicheng Liu +19

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Search

Shihao Wang

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (4)