Search papers, labs, and topics across Lattice.
5
0
5
Multi-source visual reasoning can actually *hurt* performance when modalities conflict, but MARS solves this by adaptively emphasizing mutual promotion and suppressing noise, leading to significant gains.
Current MLLMs are still surprisingly reliant on textual reasoning, even when visual information is crucial for solving STEM problems.
Continual learning methods for Video-LLMs face a fundamental trade-off: mitigating catastrophic forgetting often comes at the cost of generalization or prohibitive computational overhead.
Quantizing large vision-language models just got a whole lot better: a new token-level sensitivity metric closes the accuracy gap with full-precision models by up to 1.6% in 3-bit weight-only quantization.
Latent visual reasoning in multimodal LLMs is largely ineffective, as the "imagination" happening in latent space doesn't actually attend to the input or influence the output, making explicit text-based imagination a surprisingly better alternative.