PrincetonMar 10, 2026arXiv:2603.10165

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xiaolong Jin, Mengdi Wang, Ling Yang

AI Summary

OpenClaw-RL is introduced as a novel RL framework that leverages next-state signals (user replies, tool outputs, etc.) as a universal learning source for training agents across diverse tasks. It extracts evaluative signals as scalar rewards using a PRM judge and directive signals through Hindsight-Guided On-Policy Distillation (OPD), using textual hints from the next state to provide token-level directional advantage supervision. The asynchronous design allows for simultaneous live request serving, PRM judging, and policy updating, enabling continuous agent improvement through interaction.

Key Contribution

Forget finetuning on curated datasets – OpenClaw-RL lets agents learn directly and continuously from *every* interaction, turning user replies, tool outputs, and even GUI changes into valuable RL signals.

Abstract

Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: https://github.com/Gen-Verse/OpenClaw-RL

Code Generation & Program Synthesis RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References55

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OpenClaw-RL: Train Any Agent Simply by Talking

Related Papers