Search papers, labs, and topics across Lattice.
3
0
5
3
Leaderboard-topping video models are still surprisingly brittle, failing on basic video reasoning tasks unless given the right textual cues.
Video-LLMs can hallucinate and perform *worse* with chain-of-thought reasoning due to "visual anchor drifting," but a simple frame repetition strategy guided by a learned scoring function can fix it.
Ditch autoregressive MLLMs: Omni-Diffusion proves that mask-based discrete diffusion models can unify multimodal understanding and generation across text, speech, and images with competitive performance.