Corresponding authorPKUApr 9, 2026arXiv:2604.07769

An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models

Chen Xing, Chengli Xing, Zhengran Zeng, Z. Zeng, Gexiang Fang, Rui Xie, Wei Ye, Shikun Zhang

AI Summary

This paper investigates the effectiveness of data-influence-score filtering for pretraining Code-LLMs, adapting the technique to generative programming tasks by using downstream coding task loss as a performance metric. They pre-trained a 1B parameter Code-LLM on 100B code tokens and evaluated the impact of filtering based on data influence scores. Results show that data-influence-score filtering improves programming performance, but the characteristics of beneficial training data vary across different downstream tasks.

Key Contribution

Turns out, what makes for good code pre-training data depends heavily on the downstream task you're targeting.

Abstract

Recent advancements in code large language models (Code-LLMs) have demonstrated remarkable capabilities in resolving programming related tasks. Meanwhile, researchers have recognized that the quality of pre-training data is crucial for improving LLM performance. However, most of the existing research on pre-training data filtering has focused on general datasets, and little attention for programming datasets. In this paper, we aim to address this gap by exploring the effectiveness of a widely used general data filtering technique, i.e., data-influence-score filtering, within the context of programming-related datasets. To this end, we first introduce a method for calculating data-influence-score for generative programming tasks which involves transforming a variety of downstream coding tasks into validation sets and using the models loss on these sets as a performance metric. Next, we pre-train a Code-LLMs with 1 billion parameters from scratch on a dataset of 100 billion code tokens. Based on it, we conduct an extensive empirical study to evaluate the effectiveness of data-influence-score filtering methods. Specifically, we examine how well this technique improves model performance, investigate how the characteristics of beneficial training data vary across different training stages and programming tasks, and assess the feasibility of prediction-based data-influence-score filtering method. Our findings show that data-influence-score filtering based on validation-set-loss can enhance models programming performance. Moreover, we observe that the criteria of beneficial training data differ significantly across various downstream programming tasks.

Code Generation & Program Synthesis Data Curation & Synthetic Data Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models

Related Papers