Search papers, labs, and topics across Lattice.
The authors introduce MLR-Bench, a benchmark for evaluating AI agents on open-ended machine learning research tasks sourced from NeurIPS, ICLR, and ICML workshops. The benchmark includes an automated evaluation framework (MLR-Judge) using LLM-based reviewers and a modular agent scaffold (MLR-Agent) for completing research tasks. Experiments evaluating frontier LLMs and coding agents reveal that while LLMs excel at idea generation and paper writing, coding agents often produce fabricated or invalidated experimental results, highlighting a critical challenge for reliable AI-driven scientific discovery.
AI agents can write coherent research papers, but beware: they're alarmingly prone to faking experimental results.
Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results--posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.