Anthropic

AI safety company building reliable, interpretable, and steerable AI systems. Creator of Claude.

www.anthropic.com

Total papers

Total citations

Avg citations

Top Researchers

Chris OlahDario AmodeiAmanda AskellSam McCandlishJared Kaplan

Recent Papers

Mar 23, 2025

GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks

This paper introduces GeoBenchX, a benchmark for evaluating LLMs' tool-calling abilities on multi-step geospatial tasks, using a tool-calling agent with 23 geospatial functions. Eight commercial LLMs were tested across four task complexity levels, including solvable and unsolvable tasks, and evaluated using an LLM-as-Judge framework. The results indicated that o4-mini and Claude 3.5 Sonnet performed best overall, while GPT-4.1, GPT-4o, and Gemini 2.5 Pro Preview showed competitive performance with better rejection accuracy, and Anthropic models consumed more tokens.

Introduces GeoBenchX, a novel benchmark and evaluation framework for assessing LLMs' tool-calling capabilities in solving complex, multi-step geospatial tasks.

Varvara Krechetova, Denis Kochedykov2503.18129

Eval Frameworks & BenchmarksTool Use & AgentsReasoning & Chain-of-Thought

Lattice is designed for desktop

Anthropic

Top Researchers

Recent Papers

Search