Search papers, labs, and topics across Lattice.
Quadrupedal robots can now perform dynamic loco-manipulation in the real world, matching human teleoperation, using only onboard ego-centric vision and a low-frequency (5Hz) open-vocabulary detector.
Embodied agents can now exhibit coherent, long-horizon, self-directed behavior by reasoning about abstract value trade-offs, a capability previously absent in instruction-following or needs-driven approaches.
Stop letting SFT ruin your LMMs: PRISM uses on-policy distillation to realign your model *before* RL, boosting performance by up to 6%.
By pretraining a VLA model with goal-conditioned RL, PRTS learns to reason about goal reachability, leading to substantial gains in long-horizon robotic tasks and zero-shot generalization.
Achieve real-time robotic action with 79-91% success while generating high-fidelity 4D reconstructions, all within a single unified world model.
Robots can now navigate complex outdoor environments using only high-level human instructions and readily available GPS/map data, bypassing the need for expensive HD maps or limited short-horizon policies.
Imagine specifying complex 3D articulations with just a few 2D sketches – Sketch2Arti makes it a reality.
LLMs can now generate driving rules from traffic laws with significantly improved accuracy by grounding their reasoning in structured traffic scenarios.
Autonomous vehicles can now plan trajectories 10x faster without sacrificing performance, thanks to a novel architecture that learns complex driving behaviors in latent space during training.
MLLMs often *hallucinate* the referent of a pointing gesture, latching onto nearby or salient objects instead of truly understanding spatial semantics.
Achieve millimeter-level accuracy in 3D human body fitting from multi-modal inputs, even with scale distortion common in AI-generated assets.
Point-VLMs can learn to see the world as it really is: targeted reward assignment and cross-modal verification nearly close the reality gap in 3D reasoning.
Achieve superhuman dexterity: ALAS unlocks robust long-horizon task completion by decoupling environment understanding from motor control, enabling generalization across diverse human-scene interaction scenarios.
Pocket-sized VLA models can now achieve state-of-the-art robot manipulation performance by pre-training on a curated multimodal dataset and injecting manipulation-relevant representations into the action space.
Seemingly impressive VLA performance on robotic benchmarks crumbles when stress-tested with causal interventions, exposing a reliance on brittle shortcuts rather than genuine embodied reasoning.
A custom-designed tendon-driven wrist, combined with a particle-spring model, enables precise and robust control of highly flexible objects like spinning handkerchiefs.
Time-to-collision metrics miss critical collision risk information, but a new 2D acceleration-based metric anticipates collisions far better.
VLAs can learn to adapt to new environments at test time without any fine-tuning, achieving significant performance gains on robotic manipulation and Atari games.
Imagine automating the tedious engineering tasks in embodied AI development with a conversational agent, freeing researchers to focus on core algorithmic innovation.
Achieve superior 3D scene reconstruction from aerial images with significantly reduced transmission overhead by directly optimizing communication for rendering quality.
Unlock zero-shot generalization in robot manipulation by generating diverse, affordance-aware training data with 3D generative models and Vision Foundation Models.
Robots can now focus on the *right* body parts for interaction, thanks to a new vision-language model that understands human motion commands and precisely localizes task-relevant 3D keypoints.
Robots can now better assemble boxes in the real world thanks to a video-generative value model that anticipates future states, moving beyond static snapshots for more reliable task progress assessment.
World models are more valuable for synthesizing structured supervision for navigation learning than for directly providing action-ready imagined evidence.
Ditch the slow per-scene optimization: SurfelSplat reconstructs surfaces from sparse views in under a second, matching state-of-the-art accuracy with a 100x speedup.
Hierarchical RL can tame the curse of dimensionality in fleet management, enabling superior maintenance and logistics decisions compared to monolithic approaches.
Synthesizing novel views from extrapolated poses no longer requires dense supervision, thanks to a geometry-conditioned diffusion model that explicitly learns to handle out-of-trajectory artifacts.
Generating coordinated bimanual grasps on diverse objects is now possible thanks to a dataset of nearly 10 million grasps and a model that adapts to object geometry and size.
Legged robots can now recover from sensor noise and crazy user commands with 10x greater reliability, thanks to a new method that respects the robot's competence boundaries.
Overcoming infrastructure limitations, not algorithmic capability, is the key to unlocking the potential of Embodied AI for Science in the Global South.
VLA models, seemingly robust, crumble when faced with diverse linguistic variations, as a new red-teaming approach reveals a staggering drop in task success from 93% to just 6%.
Achieve state-of-the-art 3D object detection in adverse weather by adaptively routing between LiDAR, radar, and fused features based on learned weather conditions.
Unlock the power of RL for PID control: this method automatically translates complex RL policies into simple, robust PID gains, offering a plug-and-play upgrade for existing automation systems.
Frontier video models like Veo-3 can generate surprisingly good task-level plans for robot manipulation, but still need help with the fine details.
Finally, underwater SLAM can produce photorealistic maps thanks to a novel medium-aware Gaussian map representation.
Humanoid robots can now traverse complex terrains with human-like gaits, thanks to a surprisingly simple and efficient framework that eschews adversarial training.
Animate 3D characters using bananas and plush toys – DancingBox turns everyday objects into motion capture proxies, making animation accessible to novices.
By aligning image and LiDAR features to event-derived spatiotemporal edges, $x^2$-Fusion achieves state-of-the-art accuracy in optical and scene flow estimation, particularly under challenging conditions where other multimodal fusion methods falter.
Forget training separate policies for every robot hand – this method learns one policy to control them all, slashing data needs and boosting performance by 50% in cross-embodiment manipulation.
Achieve 92% accuracy in identifying who's commanding a robot from 34 meters away by fusing IMU and camera data, a 48% leap over prior art.
Current embodied AI agents falter when faced with the multi-floor complexity of MANSION, a new language-driven framework for generating realistic, building-scale 3D environments.
RAMBO's instability got you down? ROMI offers a robust, value-aware model learning approach with implicitly differentiable adaptive weighting that outperforms RAMBO and other SOTA methods in offline RL benchmarks.
Ditch the optimization: MoRe achieves real-time 4D scene reconstruction from monocular video using a feedforward transformer that disentangles motion and structure.
Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.
Diffusion planners get a boost in robustness and performance thanks to SAGE, a self-supervised method that weeds out dynamically inconsistent plans using a learned latent consistency signal.
Achieve state-of-the-art monocular re-localization in OpenStreetMap by cleverly aligning image semantics with map data, enabling faster and more accurate localization than dense matching approaches.
Achieve real-time, drift-free online 3D reconstruction by decoupling memory into actively refreshed local geometry and a stable, persistent global structure.
Achieve more realistic and physically plausible scene reconstructions from video by explicitly optimizing viewpoints for object generation and synthesizing scene graphs within a 3D simulator.
LLMs can now handle autonomous driving tasks with greater precision and efficiency thanks to DriveCode, which replaces discrete number tokens with continuous embeddings.
Achieve both long-term scene consistency and precise camera control in world models with UCM, a novel framework sidestepping explicit 3D reconstruction.