Search papers, labs, and topics across Lattice.
The paper introduces Mixture Optimization via Scaling-Aware Iterative Collection (MOSAIC), a data selection framework that optimizes data mixtures by iteratively adding data from domains that maximize the change in evaluation metrics, guided by neural scaling laws fitted to each domain. MOSAIC partitions the dataset into domains, fits neural scaling laws from each domain to the evaluation metrics, and then optimizes a data mixture. Applied to end-to-end autonomous driving, MOSAIC outperforms baselines on the Extended Predictive Driver Model Score (EPDMS) using significantly less data (up to 80%).
Training autonomous vehicles can be dramatically sped up: MOSAIC achieves state-of-the-art driving performance with 80% less data by intelligently selecting training examples based on scaling laws.
Large-scale deep learning models for physical AI applications depend on diverse training data collection efforts. These models and correspondingly, the training data, must address different evaluation criteria necessary for the models to be deployable in real-world environments. Data selection policies can guide the development of the training set, but current frameworks do not account for the ambiguity in how data points affect different metrics. In this work, we propose Mixture Optimization via Scaling-Aware Iterative Collection (MOSAIC), a general data selection framework that operates by: (i) partitioning the dataset into domains; (ii) fitting neural scaling laws from each data domain to the evaluation metrics; and (iii) optimizing a data mixture by iteratively adding data from domains that maximize the change in metrics. We apply MOSAIC to autonomous driving (AD), where an End-to-End (E2E) planner model is evaluated on the Extended Predictive Driver Model Score (EPDMS), an aggregate of driving rule compliance metrics. Here, MOSAIC outperforms a diverse set of baselines on EPDMS with up to 80\% less data.