Stanford HAISydneyMar 12, 2026arXiv:2603.11634

Diversity You Can Actually Measure: A Fast, Model-Free Diversity Metric for Robotics Datasets

Sreevardhan Sirigiri, N. D. Lara, Christopher Agia, F. Shkurti, Fabio Ramos

AI Summary

This paper introduces a novel, model-free diversity metric for robotics datasets based on signature transform entropy applied to the Gram matrix of a signature kernel over demonstrations. The metric quantifies diversity while respecting trajectory structure and geometry, enabling analysis of the relationship between dataset diversity and imitation learning performance. The authors propose FAKTUAL, a data curation algorithm that selects diverse demonstration subsets by maximizing entropy, and demonstrate that FAKTUAL consistently improves downstream success rates in RoboMimic, MetaWorld, and real-world manipulation tasks compared to random selection and other curation methods, with minimal computational overhead.

Key Contribution

Forget slow, model-dependent curation: FAKTUAL offers a fast, model-free way to boost robot imitation learning by directly maximizing the entropy of demonstration datasets.

Abstract

Robotics datasets for imitation learning typically consist of long-horizon trajectories of different lengths over states, actions, and high-dimensional observations (e.g., RGB video), making it non-trivial to quantify diversity in a way that respects the underlying trajectory structure and geometry. We extend Shannon and von Neumann entropy to this setting by defining signature transform-based entropy on the Gram matrix of a signature kernel over demonstrations, yielding entropy and diversity metrics that operate directly on the demonstration dataset. Building on these metrics, we study how dataset diversity affects generalization performance in robot imitation learning and propose a simple, model-free way to curate diverse demonstrations. We introduce FAKTUAL (FAst trajectory Kernel enTropy cUration for imitation Learning), a data curation algorithm that selects a subset of demonstrations maximizing entropy given a subset-size budget. FAKTUAL is fully model-free, requires no access to the imitation policy or rollouts, and adds negligible overhead relative to policy training. We evaluate our approach on image and state-based RoboMimic and MetaWorld benchmarks, as well as four real-world manipulation tasks. Across tasks and architectures, diversity-aware curation with FAKTUAL consistently improves downstream success rates over random selection, while being substantially more computationally efficient compared to recent robot data curation methods. Our results suggest that the entropy of demonstration datasets is a practical tool for understanding and improving dataset diversity in robot imitation learning.

Data Curation & Synthetic Data Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References81

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Diversity You Can Actually Measure: A Fast, Model-Free Diversity Metric for Robotics Datasets

Related Papers