Search papers, labs, and topics across Lattice.
VLAs aren't just memorizing training data; sparse autoencoders reveal a hidden layer of generalizable motion primitives that can be steered to control robot behavior across tasks.
RADAR offers a scalable, interpretable framework for understanding robot policy generalization by directly linking test-time performance to the training data, revealing the specific types of generalization required.
Current Large Audio Language Models (LALMs) struggle with basic audio understanding tasks like noise localization and cross-lingual speech, with some performing worse than random chance, despite excelling at speech recognition.
Forget retraining: you can steer a robot's behavior in real-time by nudging its internal representations with lightweight, targeted interventions.
Robots can now remember what they've done and what they need to do next for 15 minutes straight, thanks to a new memory architecture that mixes video and text.
Forget expert surveys: GPT-4.1-nano can predict the difficulty of data visualization test questions with surprisingly high accuracy, especially when combining visual and textual cues.
Turns out, the best memory design for robotic manipulation depends heavily on the task, with no single architecture dominating across the board.
Forget OCR? Powerful MLLMs can extract information from business documents just as well from images alone, challenging the necessity of traditional OCR pipelines.
Generate minute-long videos with compelling narrative structure and local realism, even with limited long-form training data, by cleverly combining supervised flow matching for global coherence with mode-seeking alignment to a short-video teacher for local fidelity.
By unifying hand motion estimation and generation into a single diffusion framework, UniHand handles heterogeneous inputs and challenging conditions like occlusions better than task-specific models.
XR gets real: control virtual worlds with your head and hands, not just text prompts.
Achieve spatially faithful image-to-image translation without cross-domain supervision by bridging diffusion models with self-supervised semantic representations.
Verification at test time can be a surprisingly effective alternative to scaling policy learning for vision-language-action alignment, yielding substantial gains in both simulated and real-world robotic tasks.
Closing the reality gap: iteratively refining a world model with real-world robot data yields a significant boost in vision-language-action policy performance.
You can now detect harmful memes with 17% better accuracy and understand *why* they're toxic, thanks to a new framework that injects cultural context and explains its reasoning.
A unified Vision-Language Model and Diffusion architecture unlocks surprisingly effective optical flow forecasting from noisy web data, enabling language-conditioned robot control and video generation.
An end-to-end learned robotic system can now clean your kitchen in a completely new house, thanks to a novel co-training approach on diverse data.