Search papers, labs, and topics across Lattice.
University of Chinese Academy of Sciences
4
0
5
Current MLLMs are still surprisingly reliant on textual reasoning, even when visual information is crucial for solving STEM problems.
Continual learning methods for Video-LLMs face a fundamental trade-off: mitigating catastrophic forgetting often comes at the cost of generalization or prohibitive computational overhead.
Quantizing large vision-language models just got a whole lot better: a new token-level sensitivity metric closes the accuracy gap with full-precision models by up to 1.6% in 3-bit weight-only quantization.
Latent visual reasoning in multimodal LLMs is largely ineffective, as the "imagination" happening in latent space doesn't actually attend to the input or influence the output, making explicit text-based imagination a surprisingly better alternative.