Search papers, labs, and topics across Lattice.
3
66
8
2
Ditch the coordinate system: VLMs can point *way* better by directly selecting visual tokens, leading to SOTA results and improved sample efficiency.
Pruning vision tokens across both the ViT and LLM can yield a 62% efficiency boost in video VLMs with minimal performance loss, and without complex text conditioning.
Robot foundation models can achieve state-of-the-art performance by explicitly reasoning about spatial plans as editable trajectory traces, rather than directly mapping perception to control.