Search papers, labs, and topics across Lattice.
2
0
5
2
Forget finetuning encoders: representing human motion as structured text unlocks surprisingly strong performance on motion understanding tasks by directly leveraging LLMs' pretrained knowledge.
VLMs struggle to align assembly diagrams and videos because they occupy disjoint visual representation spaces, revealing a fundamental limitation in cross-modal understanding.