Search papers, labs, and topics across Lattice.
State Key Laboratory of Multimodal Artificial Intelligence Systems, University of Chinese Academy of Sciences, ShanghaiTech University
4
0
7
Ditch the fragmented architectures: OneDrive unifies autonomous driving tasks within a single VLM decoder, achieving state-of-the-art performance while slashing latency.
Multimodal trackers can achieve state-of-the-art results without ballooning parameter counts by adaptively aligning cross-modal attention maps and using hierarchical mixture of experts for efficient global reasoning.
Current image quality metrics struggle to articulate *why* one high-quality image is better than another, but this challenge shows MLLMs are closing the gap by providing expert-level explanations.
MLLMs can be tricked into missing 90% of harmful content simply by encoding it in images that humans can easily read.