Search papers, labs, and topics across Lattice.
StreamingVLA achieves a remarkable 2.4x speedup and 6.5x reduction in execution halting by asynchronously parallelizing observation, action generation, and execution stages in vision-language-action models.
Embodied navigation agents, already struggling, fall apart when faced with the kinds of messy, real-world sensor and instruction corruptions that NavTrust now exposes.
By iteratively reasoning over video snippets with a Chain-of-Thought, $\text{R}^2$VLM achieves state-of-the-art long-horizon task progress estimation without needing to process entire videos at once.
Animate 3D characters using bananas and plush toys – DancingBox turns everyday objects into motion capture proxies, making animation accessible to novices.
Ditch the diffusion vs. autoregressive debate: this VLA framework uses diffusion to *draft* actions and an autoregressive model to *verify* them, boosting real-world success by nearly 20%.
A new RGB-T dataset and frequency-aware network exposes the surprising limitations of existing UAV detectors when faced with real-world camouflage and complex backgrounds.
A new mixed reality testbed lets you plug real human drivers into a CAV simulation, offering unprecedented realism for testing autonomous vehicle interactions.
Human unpredictability is now a feature, not a bug: a mixed-reality testing framework leverages human interaction to generate high-quality corner cases for vehicle-infrastructure cooperation systems.
By treating camera pose as a unifying geometric representation, WorldCam achieves significantly improved action controllability and long-horizon 3D consistency in interactive gaming world models compared to prior video diffusion transformer approaches.
By aligning image and LiDAR features to event-derived spatiotemporal edges, $x^2$-Fusion achieves state-of-the-art accuracy in optical and scene flow estimation, particularly under challenging conditions where other multimodal fusion methods falter.
Achieve real-time cattle mounting pose estimation in complex environments with FSMC-Pose, a framework that outperforms existing methods while drastically reducing computational costs.
Stop averaging over noisy robot data: PTR selectively trusts training samples based on how well their post-action consequences align with learned representations, leading to more robust offline policy learning.
Forget painstakingly creating 3D assets for robot training - ManiTwin automates the process, turning single images into simulation-ready objects at scale.
DriveFix tackles the "shaky camera" problem in 4D driving scene reconstruction, producing significantly more stable and coherent novel views by explicitly modeling spatio-temporal dependencies.
Ditch expensive, rendering-based RL for autonomous driving: PerlAD uses offline data to train agents in a fast, vector-space pseudo-simulation, outperforming prior methods by 10% on driving score.
Autonomous driving models can learn to avoid accidents *before* they happen by training on expert interventions and anticipating errors.
Forget training separate policies for every robot hand – this method learns one policy to control them all, slashing data needs and boosting performance by 50% in cross-embodiment manipulation.
Forget retraining your agent: Steve-Evolving distills execution failures into executable guardrails and successes into reusable skills, injecting them into an LLM planner for continual, parameter-free improvement.
Current embodied AI agents falter when faced with the multi-floor complexity of MANSION, a new language-driven framework for generating realistic, building-scale 3D environments.
Achieve 92% accuracy in identifying who's commanding a robot from 34 meters away by fusing IMU and camera data, a 48% leap over prior art.
Forget training separate models for different field-of-views in geo-localization — SinGeo achieves SOTA robustness with a single model, even outperforming specialized architectures.
Achieve safer and more effective human-robot collaboration by decoupling task execution from human interaction using a redundant robot's null space.
Fisheye cameras can now see the world in 4D, thanks to a new benchmark and method that tackles the unique distortions of spherical projection for improved occupancy tracking.
RAMBO's instability got you down? ROMI offers a robust, value-aware model learning approach with implicitly differentiable adaptive weighting that outperforms RAMBO and other SOTA methods in offline RL benchmarks.
Stop predicting the future, start predicting *change*: $Δ$VLA guides robotic action by modeling how world knowledge *varies* under actions, not by forecasting absolute future states.
Achieve sub-millimeter accuracy in 3D reconstruction of flexible continuum robots by enforcing global biplanar geometric consistency, even with noisy or occluded images.
LLMs can now parallel park your car: U-Parking uses them for intelligent planning in a distributed UWB-assisted autonomous system.
Ditch the optimization: MoRe achieves real-time 4D scene reconstruction from monocular video using a feedforward transformer that disentangles motion and structure.
Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.
By predicting latent features instead of pixels, PROSPECT achieves state-of-the-art VLN performance and long-horizon robustness without adding inference overhead.
Achieve state-of-the-art semantic scene understanding from sparse views with a feed-forward architecture that generalizes across diverse environments.
A novel 2-DoF crank-slider mechanism lets a wire-driven robotic fish swim fast *and* turn sharply, breaking the trade-off between speed and maneuverability.
By disentangling structure and motion in the latent space, CoWVLA achieves superior visuomotor learning compared to standard world-model and latent-action approaches.
Diffusion planners get a boost in robustness and performance thanks to SAGE, a self-supervised method that weeds out dynamically inconsistent plans using a learned latent consistency signal.
Robots that learn from their mistakes *while* navigating? SERP unlocks this by evolving the action model in-context during replanning, boosting success rates and cutting token costs.
Achieve up to 139x speedup in robust trajectory optimization by exploiting GPU parallelism with custom CUDA kernels and novel optimization architectures.
Achieve state-of-the-art monocular re-localization in OpenStreetMap by cleverly aligning image semantics with map data, enabling faster and more accurate localization than dense matching approaches.
Achieve real-time, drift-free online 3D reconstruction by decoupling memory into actively refreshed local geometry and a stable, persistent global structure.
Achieve more realistic and physically plausible scene reconstructions from video by explicitly optimizing viewpoints for object generation and synthesizing scene graphs within a 3D simulator.
Achieve 100% success rates in visually ambiguous manipulation tasks by fusing high-frequency tactile data with low-frequency visual planning, outperforming visual-only baselines and satisfying hard real-time constraints.
Achieve dexterous hand retargeting that's both fast and generalizable by decomposing reinforcement learning policies into finger-specific modules coordinated by a residual network.
LLMs can now handle autonomous driving tasks with greater precision and efficiency thanks to DriveCode, which replaces discrete number tokens with continuous embeddings.
Achieve both long-term scene consistency and precise camera control in world models with UCM, a novel framework sidestepping explicit 3D reconstruction.
A novel asymmetric V2X communication strategy allows even non-connected vehicles to benefit from occlusion risk mitigation, outperforming traditional symmetric models with significantly lower penetration rates.
Forget static cues: VAGNet grounds 3D object affordances by watching how humans actually use them in videos, significantly improving localization of interaction regions.
LLMs can now actively perceive and react to anomalies during scientific simulations, leading to more reliable and accurate results in complex engineering and modeling tasks.
Current VLM-driven embodied agents struggle with fundamental skills like navigation and object manipulation when evaluated in realistic, low-level action spaces, severely hindering their performance on complex tasks.
Ditch the discrete anchors: MeanFuser achieves state-of-the-art autonomous driving trajectory generation by using a continuous Gaussian Mixture Noise representation and a mean-flow formulation for faster, more robust planning.
Cycle consistency unlocks SOTA cross-view object correspondence in videos without ground-truth annotations, even enabling test-time training.
By aligning latent representations with multiple visual foundation models, FRAPPE offers a more scalable and data-efficient way to imbue generalist robotic policies with robust world-awareness.