Search papers, labs, and topics across Lattice.
ERNIE Team
3
0
7
Keyframe-residual captioning unlocks high-fidelity video-language supervision, surpassing direct VLM captioning in capturing fine-grained visual details.
LLaVA-OV-2's codec-stream tokenization lets it crush existing video-language models, especially in tasks requiring fine-grained temporal understanding of high-frequency motion.
A 1.7B parameter model can now rival much larger audio language models, thanks to a novel architecture and data synthesis pipeline.