Search papers, labs, and topics across Lattice.
Aalto University
3
0
5
VLMs struggle to align assembly diagrams and videos because they occupy disjoint visual representation spaces, revealing a fundamental limitation in cross-modal understanding.
Shrinking a 2B vision-language retriever to a 70M text-only model achieves 95% of the original quality and outperforms a 2B baseline, while slashing query latency by 50x.
Ditch global embeddings for text-motion retrieval: this method uses joint-angle motion images and token-patch late interaction to achieve state-of-the-art accuracy and interpretability.