
Anthropic
AI safety company building reliable, interpretable, and steerable AI systems. Creator of Claude.
www.anthropic.com1
3
3
Top Researchers
Recent Papers
This paper introduces GeoBenchX, a benchmark for evaluating LLMs' tool-calling abilities on multi-step geospatial tasks, using a tool-calling agent with 23 geospatial functions. Eight commercial LLMs were tested across four task complexity levels, including solvable and unsolvable tasks, and evaluated using an LLM-as-Judge framework. The results indicated that o4-mini and Claude 3.5 Sonnet performed best overall, while GPT-4.1, GPT-4o, and Gemini 2.5 Pro Preview showed competitive performance with better rejection accuracy, and Anthropic models consumed more tokens.
Introduces GeoBenchX, a novel benchmark and evaluation framework for assessing LLMs' tool-calling capabilities in solving complex, multi-step geospatial tasks.

