Huijia Zhu

Omni-modal LLMs can ace captioning and QA, but AVID reveals they're surprisingly bad at spotting audio-visual inconsistencies in videos, a crucial skill for trustworthy AI.

Zixuan Chen, Depeng Wang, Hao Lin +6

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Feb 12, 2026

Feb 12, 2026·also Ant Group, HIT, USC, ZJU

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Ditch the slow, iterative zooming during MLLM inference: Region-to-Image Distillation lets you bake those agentic zooming benefits directly into a single forward pass.

Lai Wei, Liangbo He, Jun Lan +9

Computer Vision Inference & Quantization Multimodal Models

Search

Huijia Zhu

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (3)