Search papers, labs, and topics across Lattice.
Robots can now "see" hidden objects and understand articulation by learning from human egocentric video, even if they can't physically explore those areas themselves.
Video generative models already contain powerful image restoration priors, and can be coaxed into state-of-the-art performance with just 1,000 training examples.
MLLMs can now handle 4K videos up to 100x faster thanks to AutoGaze, which selectively attends to only the most informative patches.
Zero-shot robotic manipulation is now within reach: TiPToP matches a 350-hour fine-tuned model without *any* robot data.
By dynamically adjusting contrastive learning temperatures based on data density, MM-TS achieves state-of-the-art results on multimodal long-tail datasets.
Forget hand-engineered features: this approach learns symbolic representations for robotic planning directly from pixels using VLMs, enabling impressive zero-shot generalization to new environments and goals.
Forget simulated manipulation鈥擬anipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.
Independently trained multimodal models like CLIP aren't so independent after all: a single orthogonal transformation can align their embedding spaces across both image and text modalities.
VLMs can be easily swayed by subtle, optimized visual prompts, revealing vulnerabilities in their decision-making processes that could be exploited in real-world applications.
By cleverly repurposing text-to-video diffusion models, VideoSketcher achieves high-quality sequential sketch generation from extremely limited human-drawn sketch data.
Injecting spatial transcriptomics data into existing pathology foundation models unlocks significant performance gains across a range of downstream tasks, including molecular status prediction and gene-to-image retrieval.
Quadrupedal robots can now nimbly navigate stairs and rough terrain thanks to a new multimodal RL approach that doesn't require feeling around with its front feet.
Forget expensive human annotation: this dual-loop method automatically cleans remote sensing image-text datasets, boosting T2I model performance by over 35%.
Forget hand-annotated data: ChartGen automatically generates 222.5K chart-image/code pairs, exposing surprising weaknesses in today's VLMs at reconstructing plotting scripts.
Achieving 80% accuracy on VQA v2.0 proves that combining Visual BERT, ViLT, and memory-augmented attention can significantly outperform traditional VQA models.