Tsinghua AIFeb 19, 2026arXiv:2602.17465

Entropy-Based Data Selection for Language Models

AI Summary

The paper introduces Entropy-Based Unsupervised Data Selection (EUDS), a computationally efficient framework for selecting fine-tuning data for language models based on the uncertainty estimation of the data. EUDS leverages entropy as a proxy for data usability, enabling effective data filtering without high compute budgets. Experiments on sentiment analysis, topic classification, and question answering tasks demonstrate that EUDS reduces computational costs and improves training time efficiency with less data.

Key Contribution

Fine-tune your LLM faster on less data: this entropy-based data selection method slashes compute costs without sacrificing performance.

Abstract

Modern language models (LMs) increasingly require two critical resources: computational resources and data resources. Data selection techniques can effectively reduce the amount of training data required for fine-tuning LMs. However, their effectiveness is closely related to computational resources, which always require a high compute budget. Owing to the resource limitations in practical fine-tuning scenario, we systematically reveal the relationship between data selection and uncertainty estimation of selected data. Although large language models (LLMs) exhibit exceptional capabilities in language understanding and generation, which provide new ways to alleviate data scarcity, evaluating data usability remains a challenging task. This makes efficient data selection indispensable. To mitigate these issues, we propose Entropy-Based Unsupervised Data Selection (EUDS) framework. Empirical experiments on sentiment analysis (SA), topic classification (Topic-CLS), and question answering (Q&A) tasks validate its effectiveness. EUDS establishes a computationally efficient data-filtering mechanism. Theoretical analysis and experimental results confirm the effectiveness of our approach. EUDS significantly reduces computational costs and improves training time efficiency with less data requirement. This provides an innovative solution for the efficient fine-tuning of LMs in the compute-constrained scenarios.

Data Curation & Synthetic Data Natural Language Processing Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Entropy-Based Data Selection for Language Models

Related Papers