Apple MLUCSBFeb 20, 2025arXiv:2502.14499

MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Deepak Nathani, Lovish Madaan, Nicholas Roberts, Niko-lay Bashlykov, A. Menon, Vincent Moens, A. Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, R. Raileanu

AI Summary

The paper introduces MLGym and MLGym-Bench, a novel framework and benchmark for evaluating and training LLM agents on diverse AI research tasks spanning computer vision, NLP, RL, and game theory. It provides a Gym environment tailored for ML tasks, enabling RL-based training of agents capable of generating hypotheses, processing data, implementing ML methods, and iterating on experiments. Evaluations of frontier LLMs like Claude-3.5-Sonnet and GPT-4o on MLGym-Bench reveal that while they can improve on baselines through hyperparameter tuning, they struggle with generating truly novel research contributions.

Key Contribution

LLMs can now play at being AI researchers, but they're mostly just good at hyperparameter sweeps, not groundbreaking discoveries.

Abstract

We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements. We open-source our framework and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents.

Eval Frameworks & Benchmarks RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations48

Influential citations3

References99

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Related Papers