Search papers, labs, and topics across Lattice.
Existing zero-shot multimodal information extraction models struggle with real-world scenarios containing both seen and unseen categories, but this work solves it by modeling hierarchical semantic relationships in hyperbolic space and aligning semantic similarity distributions.
Nail design retrieval gets a major upgrade: NaiLIA leverages dense intent descriptions and palette queries to outperform standard methods, opening the door to more nuanced and personalized image search.
Achieve centimeter-level precision in robot trajectory generation without sacrificing kinematic smoothness by fusing relational and temporal inductive biases within a diffusion framework.
Ditch slow, external segmentation pipelines: TrajTok learns trajectory tokens end-to-end, boosting video understanding while staying lean and adaptable.
Key contribution not extracted.
Achieve scalable and consistent multi-reference image editing by dynamically serializing reference images into a coherent latent sequence, outperforming existing diffusion-based methods.
Unleashing diffusion models' spatial reasoning potential is now possible without expensive joint training, thanks to a clever plug-and-play framework that leverages MLLMs for layout planning.
Despite their promise, even the best multimodal LLM (GPT-4o) achieves only 26% accuracy in grading knee osteoarthritis from radiographs, revealing a significant gap in clinical reliability.