Search papers, labs, and topics across Lattice.
Department of Automation, University of Science and Technology of China, Hefei, China
2
1
5
8
Freezing most weights and only LoRA-tuning a vision-language model achieves near state-of-the-art multimodal interleaved reasoning performance, proving that targeted adaptation can rival full fine-tuning.
VLN agents can navigate more effectively by learning commonsense relationships between rooms and landmarks, thanks to a new method that injects knowledge from ChatGPT, BLIP-2, and Stable Diffusion.