Search papers, labs, and topics across Lattice.
University of Science and Technology of China
2
0
3
Training with local visual cues can dramatically enhance MLLMs' ability to extract fine-grained visual details without altering their inference interface.
Video-LLMs can hallucinate and perform *worse* with chain-of-thought reasoning due to "visual anchor drifting," but a simple frame repetition strategy guided by a learned scoring function can fix it.