CMU MLChung-AngKAISTNAVER LabsSNUJun 1, 2026arXiv:2606.02404

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim, Dayoon Ko, Jeonghun Park, Haneul Yoo, Jaewon Cho, Junghun Park, Changyoon Lee, Kyochul Jang, Jaeyeon Kim, Eunsu Kim, Woojin Cho, Seungone Kim

AI Summary

This paper introduces K-BrowseComp, a benchmark for evaluating web-browsing agents specifically designed for Korean contexts, comprising 400 problems with a validated subset of 300 problems. The findings reveal that leading LLMs, including GPT-5.5 and DeepSeek-V4-Pro, achieve only 30.00-45.67% accuracy on the verified subset, indicating a significant performance gap compared to the original BrowseComp. Additionally, a synthetic diagnostic split shows that even the strongest models struggle, with only 26.00% accuracy, highlighting the challenges faced by current models in navigating Korean web content.

Key Contribution

Leading LLMs falter in Korean web-browsing tasks, achieving less than half the accuracy found in previous benchmarks.

Abstract

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Related Papers