DAMOUIUCJun 1, 2026arXiv:2606.02373

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Pengcheng Jiang, Zhiyi Shi, K. Hong, Kelly Hong, Xueqiang Xu, Jiashuo Sun, Jimeng Sun, Hammad Bashir, Jiawei Han

AI Summary

This paper introduces Harness-1, a 20B parameter search agent that utilizes a stateful search harness to manage working memory externally, allowing the reinforcement learning policy to focus solely on semantic search decisions. By offloading routine state management to the environment, Harness-1 achieves significant improvements in retrieval performance across diverse benchmarks, with an average curated recall of 0.730, surpassing the next best open search subagent by 11.4 points. The results indicate that this approach not only enhances performance in familiar domains but also generalizes effectively to held-out transfer benchmarks, highlighting the potential of explicit state management in search tasks.

Key Contribution

Harness-1's innovative use of externalized state management leads to an 11.4 point increase in retrieval performance, setting a new standard for search agents.

Abstract

Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at https://github.com/pat-jj/harness-1.

Recommendation & Information Retrieval Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...