AI2TCS ResearchFeb 16, 2026arXiv:2602.15112

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Aniketh Garikaparthi, Manasi Patwardhan, Arman Cohan

AI Summary

The paper introduces ResearchGym, a benchmark with five containerized task environments derived from top-tier ML conference papers, where agents must reproduce research results by proposing hypotheses, running experiments, and surpassing human baselines. The authors evaluate a GPT-5-powered agent and proprietary agent scaffolds, finding a significant gap between capability and reliability, with the agent rarely outperforming baselines and struggling with long-horizon tasks. Despite occasional state-of-the-art performance, the agents exhibited failure modes like impatience, poor resource management, and context length limitations, highlighting the challenges in autonomous AI research.

Key Contribution

Even GPT-5 struggles to reliably reproduce novel research findings, highlighting a significant gap between capability and reliability for AI agents tackling end-to-end research tasks.

Abstract

We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper's proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper's metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7%) by 11.5%, and completes only 26.5% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits from context length. Yet in a single run, the agent surpasses the solution of an ICML 2025 Spotlight task, indicating that frontier agents can occasionally reach state-of-the-art performance, but do so unreliably. We additionally evaluate proprietary agent scaffolds including Claude Code (Opus-4.5) and Codex (GPT-5.2) which display a similar gap. ResearchGym provides infrastructure for systematic evaluation and analysis of autonomous agents on closed-loop research.

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Related Papers