AnthropicMar 23, 2025arXiv:2503.18129

GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks

AI Summary

This paper introduces GeoBenchX, a benchmark for evaluating LLMs' tool-calling abilities on multi-step geospatial tasks, using a tool-calling agent with 23 geospatial functions. Eight commercial LLMs were tested across four task complexity levels, including solvable and unsolvable tasks, and evaluated using an LLM-as-Judge framework. The results indicated that o4-mini and Claude 3.5 Sonnet performed best overall, while GPT-4.1, GPT-4o, and Gemini 2.5 Pro Preview showed competitive performance with better rejection accuracy, and Anthropic models consumed more tokens.

Key Contribution

Turns out, Claude 3.5 Sonnet and o4-mini are surprisingly good at geospatial tasks, outperforming even GPT-4.1 and Gemini 2.5 Pro Preview on a new benchmark for tool-calling LLMs.

Abstract

This paper establishes a benchmark for evaluating tool-calling capabilities of large language models (LLMs) on multi-step geospatial tasks relevant to commercial GIS practitioners. We assess eight commercial LLMs (Claude Sonnet 3.5 and 4, Claude Haiku 3.5, Gemini 2.0 Flash, Gemini 2.5 Pro Preview, GPT-4o, GPT-4.1 and o4-mini) using a simple tool-calling agent equipped with 23 geospatial functions. Our benchmark comprises tasks in four categories of increasing complexity, with both solvable and intentionally unsolvable tasks to test rejection accuracy. We develop a LLM-as-Judge evaluation framework to compare agent solutions against reference solutions. Results show o4-mini and Claude 3.5 Sonnet achieve the best overall performance, OpenAI's GPT-4.1, GPT-4o and Google's Gemini 2.5 Pro Preview do not fall far behind, but the last two are more efficient in identifying unsolvable tasks. Claude Sonnet 4, due its preference to provide any solution rather than reject a task, proved to be less accurate. We observe significant differences in token usage, with Anthropic models consuming more tokens than competitors. Common errors include misunderstanding geometrical relationships, relying on outdated knowledge, and inefficient data manipulation. The resulting benchmark set, evaluation framework, and data generation pipeline are released as open-source resources 1, providing one more standardized method for the ongoing evaluation of LLMs for GeoAI.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations3

Influential citations1

References29

Year2025

VenueProceedings of the 1st ACM SIGSPATIAL International Workshop on Generative and Agentic AI for Multi-Modality Space-Time Intelligence

Related Papers

Finding related papers...

Search

GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks

Related Papers