Search papers, labs, and topics across Lattice.
3
0
7
A 3B model can match the performance of models more than twice its size in mobile GUI automation by distilling visual history into concise natural language summaries.
Get better image captions without more data: reinforcement learning can train vision-language models to focus on image details by maximizing the similarity between images retrieved using the generated captions.
MLLMs can be significantly boosted by curriculum learning that focuses on reward design rather than data selection, dynamically weighting generalized rubrics based on the model's evolving competence.