Apr 6, 2026arXiv:2604.04872

Synthetic Sandbox for Training Machine Learning Engineering Agents

Yuhang Zhou, Yuhang Zhou, Lizhu Zhang, Lizhu Zhang, Yifan Wu, Jiayi Liu, Jiayi Liu, Xiangjun Fan, Xiangjun Fan, Zhuokai Zhao, Zhuokai Zhao, Hongfei Yan

AI Summary

This paper introduces SandMLE, a multi-agent framework that generates synthetic machine learning engineering (MLE) environments from a small set of seed tasks to enable efficient on-policy reinforcement learning. By constraining datasets to a micro-scale (50-200 samples per task), SandMLE reduces execution time by over 13x, making trajectory-wise RL feasible for MLE agents. Experiments demonstrate that SandMLE significantly outperforms supervised fine-tuning baselines on MLE-bench-lite and exhibits strong generalization to unseen agentic scaffolds on MLE-Dojo.

Key Contribution

On-policy RL for machine learning engineering agents is now practical, thanks to a synthetic sandbox that slashes execution time by 13x while boosting performance by up to 67%.

Abstract

As large language model agents advance beyond software engineering (SWE) tasks toward machine learning engineering (MLE), verifying agent behavior becomes orders of magnitude more expensive: while SWE tasks can be verified via fast-executing unit tests, MLE verification requires running full ML pipelines -- data preprocessing, model training, and metric evaluation -- on large datasets at each rollout step, rendering trajectory-wise on-policy reinforcement learning (RL) prohibitively slow. Existing approaches retreat to supervised fine-tuning (SFT) or offline proxy rewards, sacrificing the exploration and generalization benefits of on-policy RL. We observe that sandbox data size is the primary source of this bottleneck. Based on this insight, we introduce SandMLE, a multi-agent framework that generates diverse, verifiable synthetic MLE environments from a small number of seed tasks, preserving the structural and technical complexity of real-world problems while constraining datasets to micro-scale (each task is paired with only 50-200 training samples). Through extensive experiments, we show that SandMLE reduces execution time by over 13 times, enabling large-scale, on-policy trajectory-wise RL for the first time in the MLE domain. On MLE-bench-lite, SandMLE yields significant gains over SFT baselines across Qwen3-8B, 14B, and 30B-A3B, with relative medal rate improvements ranging from 20.3% to 66.9%. Furthermore, the trained policy generalizes across unseen agentic scaffolds, achieving up to 32.4% better HumanRank score on MLE-Dojo.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References36

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Synthetic Sandbox for Training Machine Learning Engineering Agents

Related Papers