Search papers, labs, and topics across Lattice.
This paper analyzes the limitations of existing vision foundation models in robotic perception within natural environments, highlighting their performance gaps when applied to field robotics tasks. Utilizing the newly introduced WildCross benchmark, which includes over 476K RGB frames with detailed depth and surface normal annotations, the authors conduct extensive experiments focusing on metric depth estimation. The findings reveal significant deficiencies in current models, underscoring the need for improved training methodologies that account for the complexities of natural settings.
Current vision models falter in natural environments, with the WildCross benchmark revealing critical gaps in depth estimation capabilities.
Natural environments present a complex challenge to robotics perception systems. Current models, particularly vision foundation models, are largely trained on structured, urban environments leading to weaknesses in their perception for field robotics tasks. We showcase the limitations of current models using our recently released WildCross benchmark, a new cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF pose and synchronized dense lidar submaps. In this work, we provide an expanded analysis of the benchmark results from the recent WildCross benchmark, with particular emphasis on expanded metric depth estimation experiments. Access to the code repository and dataset for this work can be found at https://csiro-robotics.github.io/WildCross.