Search papers, labs, and topics across Lattice.
The paper introduces the Interactive Enhanced Driving Dataset (IEDD) to address the scarcity of interactive scenarios and multimodal alignment issues in existing autonomous driving datasets for Vision-Language-Action (VLA) model development. They developed a pipeline to mine million-level interactive segments from naturalistic driving data using interactive trajectories and quantitative metrics for interaction processes. The IEDD-VQA dataset, featuring synthetic Bird's Eye View (BEV) videos with semantically aligned actions and structured language, is presented as a benchmark for evaluating and fine-tuning autonomous driving models.
Unlock richer autonomous driving models with IEDD, a new dataset of million-level interactive driving segments with strictly aligned language and action.
The evolution of autonomous driving towards full automation demands robust interactive capabilities; however, the development of Vision-Language-Action (VLA) models is constrained by the sparsity of interactive scenarios and inadequate multimodal alignment in existing data. To this end, this paper proposes the Interactive Enhanced Driving Dataset (IEDD). We develop a scalable pipeline to mine million-level interactive segments from naturalistic driving data based on interactive trajectories, and design metrics to quantify the interaction processes. Furthermore, the IEDD-VQA dataset is constructed by generating synthetic Bird's Eye View (BEV) videos where semantic actions are strictly aligned with structured language. Benchmark results evaluating ten mainstream Vision Language Models (VLMs) are provided to demonstrate the dataset's reuse value in assessing and fine-tuning the reasoning capabilities of autonomous driving models.