Search papers, labs, and topics across Lattice.
The paper introduces VLM-E2E, a novel end-to-end autonomous driving framework that leverages Vision-Language Models (VLMs) to provide attentional cues for training, addressing the loss of semantic information when converting 2D observations to 3D BEV representations. VLM-E2E integrates textual representations from VLMs into BEV features for semantic supervision, enabling the model to learn richer, attention-aware feature representations aligned with human-like driving behavior. The method also introduces a BEV-Text learnable weighted fusion strategy to dynamically balance the contributions of BEV and text features, leading to significant performance improvements on the nuScenes dataset across perception, prediction, and planning tasks.
Autonomous driving gets a boost from VLMs: fusing textual scene understanding with BEV features yields significant gains in perception, prediction, and planning.
Human drivers adeptly navigate complex scenarios by utilizing rich attentional semantics, but the current autonomous systems struggle to replicate this ability, as they often lose critical semantic information when converting 2D observations into 3D space. In this sense, it hinders their effective deployment in dynamic and complex environments. Leveraging the superior scene understanding and reasoning abilities of Vision-Language Models (VLMs), we propose VLM-E2E, a novel framework that uses the VLMs to enhance training by providing attentional cues. Our method integrates textual representations into Bird's-Eye-View (BEV) features for semantic supervision, which enables the model to learn richer feature representations that explicitly capture the driver's attentional semantics. By focusing on attentional semantics, VLM-E2E better aligns with human-like driving behavior, which is critical for navigating dynamic and complex environments. Furthermore, we introduce a BEV-Text learnable weighted fusion strategy to address the issue of modality importance imbalance in fusing multimodal information. This approach dynamically balances the contributions of BEV and text features, ensuring that the complementary information from visual and textual modalities is effectively utilized. By explicitly addressing the imbalance in multimodal fusion, our method facilitates a more holistic and robust representation of driving environments. We evaluate VLM-E2E on the nuScenes dataset and achieve significant improvements in perception, prediction, and planning over the baseline end-to-end model, showcasing the effectiveness of our attention-enhanced BEV representation in enabling more accurate and reliable autonomous driving tasks.