Search papers, labs, and topics across Lattice.
2
0
5
MLLMs can "think" with images, but their actions often don't match their reasoning, and this paper solves that with a new training method that forces them to explain what they see.
Get better image captions without more data: reinforcement learning can train vision-language models to focus on image details by maximizing the similarity between images retrieved using the generated captions.