Mar 7, 2026arXiv:2603.07300

AutoResearch-RL: Perpetual Self-Evaluating Reinforcement Learning Agents for Autonomous Neural Architecture Discovery

Nilesh Jain, Rohit Yadav, Sagar Kotian, AI Claude, Claude AI

AI Summary

AutoResearch-RL uses a PPO agent to autonomously modify a training script's neural architecture and hyperparameters, aiming to maximize validation bits-per-byte (val-bpb). The framework isolates the environment, target training script, and meta-learner to ensure fair comparisons and facilitate learning from experiment outcomes. Empirical results on a nanochat pretraining benchmark show that AutoResearch-RL can discover configurations comparable to hand-tuned baselines after 300 iterations, demonstrating the potential for unsupervised neural architecture discovery.

Key Contribution

Forget hand-tuning: AutoResearch-RL shows an RL agent can autonomously discover neural architectures and hyperparameters competitive with human experts.

Abstract

We present AutoResearch-RL, a framework in which a reinforcement learning agent conducts open-ended neural architecture and hyperparameter research without human supervision, running perpetually until a termination oracle signals convergence or resource exhaustion. At each step the agent proposes a code modification to a target training script, executes it under a fixed wall clock time budget, observes a scalar reward derived from validation bits-per-byte (val-bpb), and updates its policy via Proximal Policy Optimisation (PPO). The key design insight is the separation of three concerns: (i) a frozen environment (data pipeline, evaluation protocol, and constants) that guarantees fair cross-experiment comparison; (ii) a mutable target file (train.py) that represents the agent's editable state; and (iii) a meta-learner (the RL agent itself) that accumulates a growing trajectory of experiment outcomes and uses them to inform subsequent proposals. We formalise this as a Markov Decision Process, derive convergence guarantees under mild assumptions, and demonstrate empirically on a single GPU nanochat pretraining benchmark that AutoResearch-RL discovers configurations that match or exceed hand-tuned baselines after approximately 300 overnight iterations, with no human in the loop.

Architecture Design (Transformers, SSMs, MoE)Tool Use & Agents Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References31

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AutoResearch-RL: Perpetual Self-Evaluating Reinforcement Learning Agents for Autonomous Neural Architecture Discovery

Related Papers