Johannes Gutenberg University MainzUniversidad IberoamericanaApr 23, 2026arXiv:2604.21716

From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation

M. Bui, Xenia Heilmann, Mattia Cerrato, Manuel Mager, K. Wense

AI Summary

This paper investigates bias in code generation by evaluating LLMs on the task of generating ML pipelines, revealing a significant underestimation of bias when using simple conditional statements as proxies. They found that sensitive attributes are included in 87.7% of generated pipelines, even when irrelevant features are demonstrably excluded. This is substantially higher than the 59.2% observed when using conditional statements, highlighting the limitations of current bias evaluation benchmarks.

Key Contribution

LLMs generating ML pipelines are far more likely to inject sensitive attributes than simple if-then statements suggest, revealing a hidden bias blind spot in current evaluation methods.

Abstract

Prior work evaluates code generation bias primarily through simple conditional statements, which represent only a narrow slice of real-world programming and reveal solely overt, explicitly encoded bias. We demonstrate that this approach dramatically underestimates bias in practice by examining a more realistic task: generating machine learning (ML) pipelines. Testing both code-specialized and general-instruction large language models, we find that generated pipelines exhibit significant bias during feature selection. Sensitive attributes appear in 87.7% of cases on average, despite models demonstrably excluding irrelevant features (e.g., including"race"while dropping"favorite color"for credit scoring). This bias is substantially more prevalent than that captured by conditional statements, where sensitive attributes appear in only 59.2% of cases. These findings are robust across prompt mitigation strategies, varying numbers of attributes, and different pipeline difficulty levels. Our results challenge simple conditionals as valid proxies for bias evaluation and suggest current benchmarks underestimate bias risk in practical deployments.

Code Generation & Program Synthesis Constitutional AI & AI Ethics Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References34

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation

Related Papers