Search papers, labs, and topics across Lattice.
Beihang University Zhongguancun Academy, Beihang University
2
0
5
Pruning 90% of visual tokens without sacrificing performance could revolutionize the efficiency of 3D scene understanding in multimodal models.
Robots can now perform intricate assembly tasks and recover from errors in real-time, without any training, by fusing vision-language models with video-based kinematic priors for action planning.