Search papers, labs, and topics across Lattice.
The Chinese University of Hong Kong
3
0
7
0
Omni-modal LLMs can ace captioning and QA, but AVID reveals they're surprisingly bad at spotting audio-visual inconsistencies in videos, a crucial skill for trustworthy AI.
CLIP can now understand "no dog" without any fine-tuning, thanks to a plug-and-play module that disentangles negated semantics and penalizes false positive matches.
Key contribution not extracted.