Search papers, labs, and topics across Lattice.
The paper introduces ABot-N0, a unified Vision-Language-Action (VLA) foundation model for embodied navigation, trained on a large-scale dataset of 16.9M expert trajectories and 5.0M reasoning samples. ABot-N0 employs a hierarchical "Brain-Action" architecture, combining an LLM-based cognitive brain for reasoning with a Flow Matching-based action expert for trajectory generation. The model achieves state-of-the-art performance across 7 benchmarks in Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following tasks, demonstrating its versatility and outperforming task-specific models.
Forget task-specific architectures: a single Vision-Language-Action foundation model, ABot-N0, now dominates embodied navigation across five distinct tasks.
Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a ``Grand Unification''across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical ``Brain-Action''architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation. To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 $\text{km}^2$). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.