Search papers, labs, and topics across Lattice.
This paper introduces LoHoSearch, a novel benchmark designed to evaluate long-horizon search agents by presenting 544 human-verified questions across 11 domains, generated through an automated pipeline that leverages a knowledge graph of over 7 million Wikipedia entities. The authors highlight the limitations of existing benchmarks, which have reached a difficulty ceiling due to their reliance on human authorship and lack of comprehensive entity statistics. The results reveal that even the most advanced models achieve only 34.74% accuracy on this new benchmark, underscoring the need for more rigorous evaluation standards in long-horizon reasoning and context management.
Even the best search agents struggle to exceed 35% accuracy on a benchmark designed to push the limits of long-horizon reasoning.
Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.