Feb 26, 2026arXiv:2602.23358

A Dataset is Worth 1 MB

Elad Kimchi Shoshani, Elad Kimchi Shoshani, Leeyam Gabay, Leeyam Gabay, Yedid Hoshen, Yedid Hoshen

AI Summary

The paper introduces Pseudo-Labels as Data (PLADA), a novel dataset distillation method that transmits task knowledge by sending only class labels for relevant images from a preloaded, generic, unlabeled reference dataset, thereby eliminating pixel transmission. To mitigate distribution mismatch, PLADA employs a pruning mechanism that filters the reference dataset, retaining labels of semantically relevant images. Experiments across 10 datasets demonstrate that PLADA achieves high classification accuracy with a payload size of less than 1 MB, offering a practical solution for efficient dataset serving.

Key Contribution

Transmitting just 1 MB of pseudo-labels can transfer task knowledge with high classification accuracy, eliminating the need to send raw image data.

Abstract

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on diverse hardware and software frameworks, transmitting a pre-trained model is often infeasible; instead, agents require raw data to train their own task-specific models locally. While dataset distillation attempts to compress training signals, current methods struggle to scale to high-resolution data and rarely achieve sufficiently small files. In this paper, we propose Pseudo-Labels as Data (PLADA), a method that completely eliminates pixel transmission. We assume agents are preloaded with a large, generic, unlabeled reference dataset (e.g., ImageNet-1K, ImageNet-21K) and communicate a new task by transmitting only the class labels for specific images. To address the distribution mismatch between the reference and target datasets, we introduce a pruning mechanism that filters the reference dataset to retain only the labels of the most semantically relevant images for the target task. This selection process simultaneously maximizes training efficiency and minimizes transmission payload. Experiments on 10 diverse datasets demonstrate that our approach can transfer task knowledge with a payload of less than 1 MB while retaining high classification accuracy, offering a promising solution for efficient dataset serving.

Data Curation & Synthetic Data Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References62

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Dataset is Worth 1 MB

Related Papers