Search papers, labs, and topics across Lattice.
Existing zero-shot multimodal information extraction models struggle with real-world scenarios containing both seen and unseen categories, but this work solves it by modeling hierarchical semantic relationships in hyperbolic space and aligning semantic similarity distributions.
A lightweight VLA with deep state space models lets robots outperform larger models at language-guided manipulation while running 3x faster.
Nail design retrieval gets a major upgrade: NaiLIA leverages dense intent descriptions and palette queries to outperform standard methods, opening the door to more nuanced and personalized image search.
Scaling VLMs won't magically unlock reasoning skills; you need to address the reporting bias in training data that suppresses tacit information.
Unlock robot learning with hidden knowledge: TOPReward extracts surprisingly accurate task progress signals directly from VLM token probabilities, bypassing the need for explicit reward engineering.
Key contribution not extracted.
Achieve competitive image-text fact checking at just $0.013 per check by combining RAG with reverse image search, using a surprisingly simple and reproducible architecture.
Achieve scalable and consistent multi-reference image editing by dynamically serializing reference images into a coherent latent sequence, outperforming existing diffusion-based methods.
Unleashing diffusion models' spatial reasoning potential is now possible without expensive joint training, thanks to a clever plug-and-play framework that leverages MLLMs for layout planning.
Finally, a single 3D medical vision-language model that nails both high-level reasoning (report generation, VQA) and fine-grained segmentation from language, point, or box prompts.
Despite their promise, even the best multimodal LLM (GPT-4o) achieves only 26% accuracy in grading knee osteoarthritis from radiographs, revealing a significant gap in clinical reliability.