Robotics & Embodied AI
ApplicationsRobot learning, embodied agents, manipulation, locomotion, and sim-to-real transfer with foundation models.
Keywords
Top Labs in This Topic
Recent Papers
The paper introduces Modular Residual Reinforcement Learning (MoReL), a novel RL framework for dexterous hand retargeting that decomposes policy learning into finger-specific subpolicies and a residual coordination module. This decomposition enables efficient training from minimal demonstrations, low-latency inference, and flexible input modalities, addressing limitations of optimization-based and learning-based methods. Experiments demonstrate MoReL's superior performance and cross-platform adaptability in fine-grained dexterous manipulation tasks, validating the effectiveness of the architecture and reward design.
Introduces a modular reinforcement learning framework that decomposes dexterous hand retargeting into finger-specific subpolicies and a residual coordination module to improve generalization and reduce training data requirements.
This paper introduces SMAPPO, a scalable multi-agent reinforcement learning framework for decentralized multi-robot management in multi-machine tending scenarios. SMAPPO employs a novel observation encoder to achieve input-size invariance, enabling it to handle varying numbers of agents, machines, and storage areas without retraining. Experiments demonstrate that SMAPPO outperforms MAPPO in full retraining, curriculum learning, zero-shot generalization, and adaptability under low initial training, showing significant improvements in productivity, collision avoidance, and parts delivery.
Introduces a novel observation encoder for MAPPO that enables zero-shot generalization to variable numbers of agents and machines in multi-agent reinforcement learning.
This paper introduces a multi-degree-of-freedom reinforcement learning framework for robotic 3D measurement, enabling continuous viewpoint planning to improve the reconstruction of complex geometries. The framework uses a voxel-based state representation with dynamic ray-traced coverage updates and a dual-objective reward function to balance overlap control and viewpoint minimization. Experimental results on industrial parts show the proposed method achieves superior overlap regulation and planning efficiency compared to existing techniques, leading to more accurate 3D reconstructions.
Introduces a novel multi-DoF reinforcement learning framework for robotic 3D measurement that optimizes viewpoint planning by dynamically balancing coverage, overlap, and robotic kinematics.
This paper addresses the challenge of distributional mismatch in offline RL when transferring policies learned from hybrid (real and simulated) datasets to the real world. They propose using Progressive Neural Networks (PNNs) to transfer the offline policy, leveraging the hybrid dataset for faster learning and improved real-world adaptation. Experiments on robotic manipulation tasks demonstrate that PNNs effectively retain the learned policy, bridge the sim-to-real gap, and enable more diverse exploration during online fine-tuning.
Introduces a PNN-based transfer learning approach to mitigate distributional shift and improve real-world adaptation in offline RL using hybrid datasets.
This paper introduces General Utility Markov Games (GUMGs), an extension of Convex Markov Games (cMGs) that allows for coupling between agents' occupancy measures, and proves that Nash equilibria in GUMGs coincide with fixed points of projected pseudo-gradient dynamics due to a novel agent-wise gradient domination property. Leveraging this characterization, the authors provide a simplified proof of Nash equilibrium existence, demonstrate the existence of Markov perfect equilibria, and derive a policy gradient theorem for GUMGs. Furthermore, they establish iteration and sample complexity guarantees for computing approximate-NE in potential GUMGs using policy gradient methods.
Establishes a novel agent-wise gradient domination property in General Utility Markov Games (GUMGs), enabling a characterization of Nash equilibria as fixed points of projected pseudo-gradient dynamics and facilitating the design and analysis of policy gradient algorithms.
The paper introduces Agent-guided Policy Search (AGPS), a novel reinforcement learning framework that replaces human supervisors with a multimodal agent to improve sample efficiency in robotic manipulation tasks. AGPS leverages the agent as a semantic world model, using executable tools to provide corrective waypoints and spatial constraints for exploration. Experiments on precision insertion and deformable object manipulation tasks demonstrate that AGPS outperforms Human-in-the-Loop methods, achieving better sample efficiency by automating the supervision pipeline.
Introduces Agent-guided Policy Search (AGPS), a framework that automates robot reinforcement learning by using a multimodal agent to provide corrective guidance, thereby improving sample efficiency and scalability compared to human-in-the-loop methods.
The paper introduces DynaHOI-Gym, a new online closed-loop platform for benchmarking hand motion generation in dynamic hand-object interaction (HOI) scenarios, addressing the limitations of existing benchmarks focused on static objects. To facilitate research, the authors release DynaHOI-10M, a large-scale dataset comprising 10 million frames and 180K hand capture trajectories with diverse target motions. They also present an observe-before-act (ObAct) baseline that leverages spatiotemporal attention, demonstrating improved location success rates in the dynamic HOI setting.
Introduces DynaHOI-Gym and DynaHOI-10M, a novel benchmark and dataset for evaluating hand motion generation in dynamic hand-object interaction scenarios.
The paper addresses the challenge of sparse rewards in Reinforcement Learning for GUI agents by introducing Adaptive Milestone Reward (ADMIRE), a mechanism that dynamically distills milestones from successful explorations to provide verifiable, adaptive rewards. ADMIRE employs an asymmetric credit assignment strategy to denoise successful trajectories and scaffold failed ones, effectively balancing reward fidelity and density. Experiments on AndroidWorld demonstrate over 10% improvement in success rate across different base models, with strong generalizability observed in web navigation and embodied tasks.
Introduces ADMIRE, an adaptive milestone reward mechanism with asymmetric credit assignment, to improve temporal credit assignment in long-horizon GUI agent tasks.
This paper introduces a novel control framework that combines conformal prediction (CP) and system level synthesis (SLS) to achieve robust out-of-distribution (OOD) planning and control with learned dynamics models. The method uses weighted CP with a learned covariance model to derive high-confidence model error bounds, which are then incorporated into an SLS-based robust nonlinear MPC formulation with volume-optimized reachable sets for constraint tightening. Empirical results on nonlinear systems like a 4D car and a 12D quadcopter demonstrate improved safety and robustness, particularly in OOD scenarios, compared to baselines.
Integrates conformal prediction with system level synthesis to create a robust MPC framework that provides safety guarantees for out-of-distribution planning and control using learned dynamics models.
The paper introduces Incremental Signature Contribution (ISC), a method that decomposes truncated path signatures into a temporally ordered sequence of elements, preserving the algebraic structure and expressivity of signatures while exposing temporal evolution. This allows for processing signature-based representations using sequential models, addressing the limitation of standard path signatures which collapse temporal structure. The authors then introduce ISC-Transformer (ISCT), an offline RL model integrating ISC into a standard Transformer, and demonstrate its effectiveness on benchmark tasks, particularly in settings requiring temporal sensitivity.
Introduces Incremental Signature Contribution (ISC), a novel method to decompose path signatures into temporally ordered sequences for improved temporal sensitivity in sequential modeling tasks.
This paper introduces a geometric model for optimal locomotion of slender bodies based on sub-Riemannian geodesics, accounting for both environmental displacement and internal shape-change energy dissipation. They formulate Lagrangian least-dissipation principles as boundary value problems and solve them numerically using a consistent time and space discretization for various boundary conditions. The resulting optimal gaits match observed biological motion and provide insights into locomotion mechanisms, particularly for generalized Purcell's swimmers.
Introduces a novel geometric model for optimal locomotion that accounts for both environmental displacement and internal shape-change energy dissipation, enabling the computation of optimal gaits for slender bodies.
This paper introduces HyperDet, a radar-only 3D object detection framework that enhances raw radar data to be more compatible with LiDAR-oriented detectors. HyperDet aggregates multi-frame, multi-radar data, applies geometry-aware cross-sensor validation, and uses a foreground-focused diffusion module trained with mixed radar-LiDAR supervision to densify object structures and lift radar attributes. Experiments on the MAN TruckScenes dataset demonstrate that HyperDet improves performance with VoxelNeXt and CenterPoint, reducing the gap between radar-only and LiDAR-based detection.
Proposes HyperDet, a novel radar-only 3D detection framework that constructs a task-aware hyper 4D radar point cloud to improve performance with standard LiDAR-oriented detectors.
The paper introduces Affordance-Graphed Task Worlds (AGT-World), a framework that automatically generates interactive simulated environments and robot task policies from real-world observations by formalizing the task space as a structured graph. This graph-based approach allows for hierarchical decomposition of complex goals into atomic primitives, addressing the limitations of random proposal or static replication methods. The authors further incorporate a self-evolution mechanism with hybrid feedback, combining Vision-Language Model reasoning and geometric verification, to refine policies.
Introduces a self-evolving framework for generating simulated task environments and robot policies by structuring the task space as an affordance graph and using hybrid feedback for policy refinement.
This paper introduces FAST, a humanoid whole-body control framework designed for fast adaptation and stable motion tracking. FAST employs Parseval-Guided Residual Policy Adaptation, learning a lightweight delta action policy with orthogonality and KL constraints for efficient adaptation to new motions. The framework also incorporates Center-of-Mass-Aware Control, enhancing balance by integrating CoM-related observations and objectives.
Introduces Parseval-Guided Residual Policy Adaptation, a novel method for efficiently adapting humanoid control policies to new motions by learning a lightweight delta action policy under orthogonality and KL constraints.
This paper introduces a task planning framework that integrates Learning-Informed Object Search (LIOS) actions into high-level planning to address scenarios with missing objects. The framework models LIOS actions as deterministic, leveraging model-based calculations to estimate their cost and interleave search and execution steps. The approach demonstrates effective task planning with uncertainty, outperforming both non-learned and learned baselines in simulated ProcTHOR environments and real-world experiments involving retrieval and meal preparation tasks.
Introduces a novel planning framework that integrates learning-informed object search (LIOS) actions into task planning, enabling effective handling of missing objects by interleaving search and execution.
This paper introduces a decentralized multi-robot system for detecting and tracking floating containers in maritime environments, using a team of UAVs and an autonomous surface vessel. The system employs YOLOv8 and stereo disparity for visual detection on each UAV, followed by per-object Extended Kalman Filters (EKFs) for tracking with uncertainty-aware data association. Track summaries are exchanged and fused using covariance intersection to maintain consistency, and an information-driven assignment module optimizes target allocation and UAV viewpoints.
Introduces a decentralized multi-robot perception framework that combines visual detection, EKF tracking with uncertainty-aware data association, conservative track fusion via covariance intersection, and information-driven task assignment for robust maritime object tracking.
This paper introduces a deep learning approach to enhance social robot gaze behavior by incorporating both human and non-human stimuli, using LSTM and Transformer models trained on human gaze data collected via VR in simulated and real-world scenarios. The models predict human gaze direction with accuracies up to 72% and 71.6% for LSTM and Transformer respectively in real-world settings, outperforming existing methods by uniquely considering non-human stimuli. The system was deployed on a NAO robot and evaluated with 275 participants, demonstrating high user satisfaction.
Demonstrates a novel approach to predicting human gaze in social settings by integrating non-human stimuli and achieving state-of-the-art accuracy using LSTM and Transformer models.
This paper introduces ViTaS, a visuomotor learning framework that leverages both visual and tactile information through Soft Fusion Contrastive Learning and a CVAE module to improve performance in manipulation tasks, especially in occluded scenarios. The Soft Fusion Contrastive Learning method is designed to better exploit the alignment and complementarity of visual and tactile representations. Experiments across 12 simulated and 3 real-world environments demonstrate that ViTaS significantly outperforms existing baselines, highlighting the benefits of the proposed fusion and contrastive learning approach.
Introduces Soft Fusion Contrastive Learning to effectively fuse visual and tactile information for visuomotor tasks, improving performance in occluded scenarios by explicitly modeling the complementary nature of the two modalities.
This paper analyzes the potential of 6G networks to enhance robotic systems by mapping IMT-2030 key performance indicators to robotic functional blocks like sensing, perception, and actuation. It argues that 6G's enhanced capabilities are crucial for enabling more complex and autonomous robotic systems. The paper proposes a high-level architectural framework integrating robotic, intelligent, and network service planes and demonstrates a real-time safety framework for human-robot collaboration as a use case.
Proposes a high-level architectural framework integrating robotic, intelligent, and network service planes to leverage 6G capabilities for advanced robotics.
This paper introduces VLAW, an iterative algorithm for co-improving vision-language-action (VLA) policies and action-conditioned video generation world models using real-world rollouts. VLAW leverages real-world data to refine the world model, which is then used to generate synthetic data for further policy improvement, addressing the limitations of world models trained solely on demonstration datasets. Experiments on a real robot demonstrate a 39.2% absolute improvement in success rate over the base policy, highlighting the effectiveness of the iterative co-improvement strategy.
Introduces an iterative co-improvement algorithm, VLAW, that refines both a vision-language-action policy and an action-conditioned video generation world model through interleaved real-world data collection and synthetic data generation.
The paper introduces PathCRF, a novel framework for detecting on-ball soccer events using only player tracking data by inferring possession paths. They model player trajectories as a fully connected dynamic graph and use a Conditional Random Field (CRF) to ensure logical consistency in the inferred possession sequence. Experiments demonstrate that PathCRF accurately detects possession paths and events, reducing the need for manual annotation.
Introduces a ball-free soccer event detection framework, PathCRF, that infers possession paths from player trajectories using a CRF to enforce logical consistency.
This paper explores test-time verification as a method to improve vision-language-action (VLA) alignment, addressing the "intention-action gap" in embodied instruction following. They demonstrate that scaling both rephrased instructions and generated actions at test time enhances sample diversity and improves action selection. The authors introduce CoVer, a contrastive verifier, and a hierarchical verification inference pipeline, showing that this verification approach outperforms scaling policy pre-training on the SIMPLER and PolaRiS benchmarks.
Demonstrates that scaling test-time verification, through diverse instruction rephrasing and action candidate generation, is more effective than scaling policy pre-training for vision-language-action alignment.
The paper introduces 3DGSNav, a zero-shot object navigation (ZSON) framework that leverages 3D Gaussian Splatting (3DGS) as persistent memory for vision-language models (VLMs) to improve spatial reasoning. 3DGSNav actively constructs a 3DGS representation of the environment and uses trajectory-guided free-viewpoint rendering to generate frontier-aware first-person views, which are then combined with structured visual prompts and Chain-of-Thought prompting to enhance VLM reasoning. Experiments on multiple benchmarks and a quadruped robot show that 3DGSNav achieves competitive performance compared to existing methods.
Introduces a novel zero-shot object navigation framework that integrates 3D Gaussian Splatting as persistent memory for vision-language models, enabling trajectory-guided free-viewpoint rendering and enhanced spatial reasoning.
This paper proposes a Unified Smart Safety and Security Architecture for AI-driven mining environments, addressing challenges like poor illumination, GPS denial, and cyber-physical threats. The architecture integrates multimodal perception, secure federated learning, reinforcement learning, DTN communication, and energy-aware sensing to improve safety and security. The proposed system incorporates five core modules for miner localization, hazard understanding, federated robustness, and predictive maintenance.
Envisions and outlines a comprehensive architecture integrating diverse AI and security techniques to enhance safety and security in autonomous mining environments.
The paper introduces DTAPP-IICR, a Delivery-Time Aware Prioritized Planning method with Incremental and Iterative Conflict Resolution, for preflight planning of large UAV fleets in dynamic airspaces with temporal No-Fly Zones and heterogeneous vehicle profiles. DTAPP-IICR uses a novel 4D single-agent planner (SFIPP-ST) to generate roundtrip trajectories while enforcing temporal NFZs and modeling inter-agent conflicts as soft constraints, followed by a Large Neighborhood Search guided by a geometric conflict graph. Experiments on benchmarks with up to 1,000 UAVs demonstrate near-100% success and up to 50% runtime reduction compared to batch Enhanced Conflict-Based Search, showcasing its scalability and practicality for dense urban airspace.
Introduces DTAPP-IICR, a scalable and practical preflight planning method for large UAV fleets that integrates delivery-time awareness, prioritized planning, and iterative conflict resolution within dynamic airspaces.
The paper identifies limitations in current Vision-Language-Action (VLA) models stemming from inadequate visual representations learned through language-image contrastive learning or image-based self-supervised learning. It proposes JEPA-VLA, a method that integrates video predictive embeddings (specifically V-JEPA 2) into VLAs to improve environment understanding and policy priors. Experiments on benchmarks like LIBERO and real-robot tasks demonstrate that JEPA-VLA significantly improves performance by leveraging the ability of video predictive embeddings to encode task-relevant temporal dynamics.
Introduces JEPA-VLA, a novel approach that adaptively integrates video predictive embeddings into existing VLAs to enhance environment understanding and policy priors.
This paper addresses the limited generalization of diffusion-based policies in semantic manipulation by introducing bounding-box instructions to guide the policy's attention to target objects. They developed Label-UMI, a handheld segmentation device with an automated annotation pipeline, to efficiently collect demonstration data with semantic labels. Through real-world experiments, the authors demonstrated improved generalization and adaptability using a semantic-motion-decoupled framework and revealed a power-law relationship between generalization performance and the number of bounding-box objects, achieving 85% success rates across various tasks.
Demonstrates that bounding-box guided diffusion policies, trained on large-scale datasets collected with a novel handheld segmentation device, significantly improve generalization in semantic manipulation tasks and exhibit a power-law scaling relationship.
This paper introduces neck-mounted egocentric gaze estimation and presents a new dataset of 4 hours of video from 8 participants performing daily activities. They evaluate a transformer-based gaze estimation model (GLC) and propose two extensions: an auxiliary gaze out-of-bound classification task and a multi-view co-learning approach with a geometry-aware loss. The auxiliary classification task improves performance, while the co-learning approach does not.
Introduces a new task of neck-mounted egocentric gaze estimation and provides a corresponding dataset to facilitate research in this area.
The paper introduces LDA-1B, a robot foundation model that scales to 1B parameters by learning dynamics, policy, and visual forecasting from a new 30k-hour embodied interaction dataset (EI-30k) comprising diverse human and robot trajectories. LDA-1B leverages a structured DINO latent space for dynamics prediction to avoid pixel-space modeling and employs a multi-modal diffusion transformer to handle asynchronous vision and action streams. Experimental results demonstrate that LDA-1B outperforms existing methods on contact-rich, dexterous, and long-horizon tasks, while also enabling data-efficient fine-tuning by effectively utilizing low-quality trajectories.
Introduces a scalable robot foundation model, LDA-1B, capable of learning from diverse embodied data by predicting in a structured latent space and employing a multi-modal diffusion transformer.
The paper introduces HoloBrain-0, a Vision-Language-Action (VLA) framework designed to improve real-world robot deployment by incorporating robot embodiment priors like multi-view camera parameters and URDF into its architecture. They employ a "pre-train then post-train" paradigm, achieving SOTA results on simulation benchmarks and strong performance on real-world manipulation tasks, even with a small 0.2B-parameter variant. The authors open-source the entire HoloBrain ecosystem, including pre-trained models, post-trained checkpoints, and a full-stack VLA infrastructure called RoboOrchard, to facilitate research and adoption.
Introduces a novel VLA architecture, HoloBrain-0, that explicitly incorporates robot embodiment priors to enhance 3D spatial reasoning and improve performance in both simulation and real-world robotic manipulation tasks.
The paper introduces HAIC, a framework for humanoid robots to interact with underactuated objects having independent dynamics, addressing limitations of prior HOI methods focused on rigidly coupled objects. HAIC uses a dynamics predictor to estimate high-order object states from proprioceptive history, projecting these onto geometric priors to create a dynamic occupancy map for collision avoidance and contact affordance inference. Through asymmetric fine-tuning of a world model, HAIC achieves robust performance on agile manipulation tasks like skateboarding and cart pushing, as well as long-horizon multi-object tasks.
Introduces a dynamics predictor that estimates high-order object states from proprioceptive history and projects them onto geometric priors to create a dynamic occupancy map for robust humanoid-object interaction.
This paper introduces Counterfactual Conditional Likelihood (CCL) rewards to address redundant exploration in multiagent systems by scoring each agent's unique contribution to team exploration. CCL rewards agents for observations that are informative with respect to the joint exploration of the team, rather than solely for individual novelty. Experiments in continuous multiagent domains demonstrate that CCL accelerates learning in sparse reward environments requiring tight coordination.
Introduces Counterfactual Conditional Likelihood (CCL) rewards to incentivize efficient team exploration by rewarding agents based on their unique contribution to the team's joint exploration.
The paper introduces EasyMimic, a framework for imitation learning on low-cost robots using human video demonstrations. It extracts 3D hand trajectories from RGB videos, aligns them to the robot's gripper control space, and employs a hand visual augmentation strategy to bridge the human-to-robot domain gap. By co-training a model on processed human data and a small amount of robot data, EasyMimic achieves high performance on manipulation tasks with the LeRobot platform, reducing the need for extensive robot data collection.
Introduces a low-cost and replicable imitation learning framework, EasyMimic, that enables robots to learn manipulation policies from human video demonstrations using 3D hand trajectory extraction, action alignment, and co-training.
The paper introduces GigaBrain-0.5M*, a vision-language-action (VLA) model trained using world model-based reinforcement learning to improve multi-step action prediction. They leverage the spatiotemporal reasoning capabilities of video world models pre-trained on large video datasets to enhance VLA learning. By integrating world model-based reinforcement learning via RAMP (Reinforcement leArning via world Model-conditioned Policy), GigaBrain-0.5M* achieves significant performance gains (approximately 30%) over the RECAP baseline on complex manipulation tasks and demonstrates reliable long-horizon execution in real-world deployments.
Demonstrates that integrating world model-based reinforcement learning via RAMP into a VLA model significantly improves performance and long-horizon execution on complex manipulation tasks.
This paper introduces a method for learning structured latent representations in RL where distances reflect transition costs, providing a geometric interpretation of uncertainty without explicit probabilistic modeling. They achieve this with a multimodal latent transition model and inverse distance weighting for sensor fusion, enabling adaptive integration of multiple sensor modalities. Empirical validation on multimodal RL tasks demonstrates improved robustness to sensor noise, superior state estimation, and enhanced RL agent performance compared to baselines, eliminating the need for noise augmentation.
Introduces a novel metric space formulation for state estimation in RL that learns a transition-aware latent representation, enabling a geometric interpretation of uncertainty and adaptive sensor fusion.
This paper introduces Adaptive-RF Transmission (ART), a communication-aware planning algorithm for multi-agent robotic exploration that modulates transmission location based on signal strength and data payload size. ART aims to improve coordination and efficiency in communication-limited environments by enabling heterogeneous robot teams to share information without excessive backtracking. Simulation results across cave-inspired environments show that ART and its extension, ART-SST, outperform existing strategies, achieving significant reductions in distance traveled and exploration time.
Introduces a novel communication-aware planning algorithm, Adaptive-RF Transmission (ART), that dynamically adjusts transmission location based on signal strength and data payload size for efficient multi-agent robotic exploration.
This paper introduces Adaptive-Horizon Conflict-Based Search (ACCBS), a closed-loop multi-agent path finding algorithm that addresses the limitations of open-loop planners and closed-loop heuristics in MAPF. ACCBS employs a finite-horizon CBS variant with a horizon-changing mechanism inspired by iterative deepening MPC, dynamically adjusting the planning horizon based on computational budget. The algorithm reuses a single constraint tree to enable seamless transitions between horizons, achieving anytime behavior and asymptotic optimality.
Introduces ACCBS, a novel closed-loop MAPF algorithm that combines finite-horizon planning with dynamic horizon adjustment for improved robustness and performance guarantees.
This paper tackles cooperative path planning for heterogeneous UAV swarms using Multi-Agent Reinforcement Learning (MARL), focusing on asymmetric inter-agent dependencies and training instability. They introduce AC-MASAC, an attentive curriculum learning framework that incorporates a role-aware heterogeneous attention mechanism and a structured curriculum strategy with hierarchical knowledge transfer and stage-proportional experience replay. Empirical results on a custom simulation platform demonstrate AC-MASAC's superiority over existing methods in Success Rate, Formation Keeping Rate, and Success-weighted Mission Time.
Introduces an attentive curriculum learning framework (AC-MASAC) that explicitly models asymmetric inter-agent dependencies and mitigates sparse rewards and catastrophic forgetting in heterogeneous UAV swarm coordination.
The paper introduces Robot-DIFT, a framework that distills geometric priors from a frozen diffusion model into a deterministic Spatial-Semantic Feature Pyramid Network (S2-FPN) to improve visuomotor control. This distillation process aims to address the structural mismatch between vision encoders optimized for semantic invariance and the geometric sensitivity required for precise manipulation. Robot-DIFT, pretrained on the DROID dataset, achieves superior geometric consistency and control performance compared to discriminative baselines by leveraging the geometric dependencies encoded within diffusion model latent manifolds.
Introduces a manifold distillation approach, Robot-DIFT, to transfer geometric priors from a frozen diffusion model into a deterministic feature network, enabling geometrically consistent visuomotor control.
The paper introduces ABot-N0, a Vision-Language-Action (VLA) foundation model designed for unified embodied navigation across five core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 employs a hierarchical "Brain-Action" architecture, combining an LLM-based cognitive brain for semantic reasoning with a Flow Matching-based action expert for trajectory generation. The model is trained on a large-scale dataset of 16.9M expert trajectories and 5.0M reasoning samples, achieving state-of-the-art performance on seven benchmarks and demonstrating robust long-horizon navigation in real-world environments.
Introduces a unified Vision-Language-Action foundation model, ABot-N0, that achieves state-of-the-art performance across a diverse set of embodied navigation tasks.
The paper introduces Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation in cluttered scenes, specifically for language-grounded grasping. It constructs a hierarchical instance tree of semantic cues, using noisy masks as informative cues and employing cross-view grouping and conditional substitution to refine segmentation. The method incorporates open-vocabulary semantic embeddings for language grounding and a consistency-aware update mechanism for adapting to scene changes with minimal post-interaction data, achieving state-of-the-art performance in sparse-view and heavily cluttered environments.
Introduces a hierarchical instance tree leveraging noisy masks as cues for robust 3D instance segmentation in cluttered environments, outperforming existing methods in sparse-view scenarios.
This paper connects Joint-Embedding Predictive Architectures (JEPAs) with Quasimetric Reinforcement Learning (QRL) by focusing on a specific class of JEPA energy functions: intrinsic (least-action) energies defined as infima of accumulated local effort. It demonstrates that under closure and additivity assumptions, intrinsic energies are quasimetrics, aligning JEPAs trained on these energies with the quasimetric value functions used in QRL for goal-reaching control. The work highlights the structural mismatch between symmetric energies and one-way reachability, advocating for asymmetric (quasimetric) energies in scenarios where directionality is important.
Establishes a formal connection between intrinsic energy functions in Joint-Embedding Predictive Architectures and quasimetrics used in Quasimetric Reinforcement Learning.
The paper introduces Multi-Graph Search (MGS), a novel search-based motion planning algorithm for high-dimensional robotic systems that addresses the limitations of existing methods in terms of motion consistency and computational cost. MGS maintains and expands multiple implicit graphs, focusing exploration on promising regions and merging disconnected subgraphs as needed. The authors prove completeness and bounded suboptimality of MGS and demonstrate its effectiveness on manipulation and mobile manipulation tasks.
Introduces Multi-Graph Search (MGS), a complete and bounded-suboptimal motion planning algorithm that generalizes unidirectional and bidirectional search to a multi-graph setting for improved efficiency in high-dimensional spaces.
The paper introduces Any House Any Task (AHAT), a household task planner designed for long-horizon planning in large environments with ambiguous instructions. AHAT trains an LLM to map task instructions and textual scene graphs into PDDL subgoals, which are then solved using symbolic reasoning for optimal plan generation. To improve decomposition of complex intentions, they propose TGPO, a reinforcement learning algorithm integrating external correction of intermediate reasoning traces into Group Relative Policy Optimization (GRPO), leading to significant performance gains.
Introduces a novel household task planner, AHAT, that leverages LLMs and symbolic reasoning with a new reinforcement learning algorithm, TGPO, to achieve superior long-horizon planning performance in complex, ambiguous environments.
The paper introduces ReaDy-Go, a real-to-sim pipeline that generates photorealistic dynamic scenarios using 3D Gaussian Splatting (GS) to train visual navigation policies robust to the sim-to-real gap and moving obstacles. ReaDy-Go combines a static GS scene with dynamic human GS avatars driven by plausible motions derived from 2D trajectories, and uses a robot expert planner designed for dynamic GS representations to generate navigation datasets. Experiments demonstrate that policies trained with ReaDy-Go outperform baselines in both simulation and real-world environments, exhibiting improved navigation performance and generalization.
Introduces a real-to-sim dynamic 3D Gaussian Splatting simulation pipeline, ReaDy-Go, for training visual navigation policies robust to the sim-to-real gap and moving obstacles.
This paper introduces Supervised Token Reduction for Multi-modal LLMs (SToRM), a novel framework designed to reduce the computational cost of end-to-end autonomous driving systems that use MLLMs. SToRM employs a lightweight importance predictor with short-term sliding windows, supervised training using an auxiliary path for pseudo-supervision, and an anchor-context merging module to minimize information loss during token reduction. Experiments on the LangAuto benchmark demonstrate that SToRM achieves comparable performance to using all tokens while reducing computational cost by up to 30x, outperforming existing methods under the same token budget.
Introduces SToRM, a supervised token reduction framework that significantly reduces the computational cost of MLLM-based end-to-end autonomous driving without sacrificing performance.
The paper introduces Iskra, a system for automatically differentiating through geometry processing algorithms implemented using imperative code. Iskra leverages the adjoint method and scatter-gather mesh processing to efficiently compute gradients for algorithms using local-global and ADMM solvers. The system enables inverse geometry processing applications by providing a low-effort, fast, and memory-efficient alternative to generic differentiable optimization.
Introduces Iskra, a system that automatically generates efficient backward passes for existing geometry processing algorithms by applying the adjoint method to imperative code.
This paper introduces a contact-aware bin-packing approach for partially filled containers, addressing a limitation of prior work focused on empty containers and collision-free strategies. They use a contact-based multi-object trajectory optimizer within a model predictive controller to purposefully interact with existing objects to create space for new items. The system integrates a physics-aware perception system for pose estimation during occlusions and a method for suggesting physically-feasible placement locations.
Introduces a contact-aware bin-packing system that optimizes object trajectories to manipulate existing items within a partially filled container, enabling the placement of new items.
The paper investigates the inconsistent generalization performance of vision-proprioception policies in robotic manipulation, finding that vision plays a limited role during motion-transition sub-phases due to the policy's preference for proprioceptive signals during training. To address this, they propose Gradient Adjustment with Phase-guidance (GAP), which adaptively modulates the optimization of proprioception based on estimated probabilities of motion-transition phases. Experiments demonstrate that GAP improves the robustness and generalization of vision-proprioception policies in both simulated and real-world environments.
Introduces Gradient Adjustment with Phase-guidance (GAP) to dynamically modulate proprioception's gradient during policy learning, enabling better collaboration between vision and proprioception modalities in robotic manipulation.
This paper introduces GSO-SLAM, a real-time monocular SLAM system that bidirectionally couples Visual Odometry (VO) and Gaussian Splatting (GS) for simultaneous tracking and mapping. It addresses the limitations of unified or loosely integrated approaches by formulating a joint optimization within an Expectation-Maximization (EM) framework, refining VO-derived depth estimates and the GS representation concurrently. The method also introduces Gaussian Splat Initialization, leveraging image information and VO data to efficiently initialize the Gaussian scene, achieving state-of-the-art reconstruction fidelity and tracking accuracy in real-time.
Introduces a bidirectionally coupled Visual Odometry and Gaussian Splatting SLAM system optimized via an EM framework for real-time performance and improved accuracy.

