Mar 11, 2026arXiv:2603.10623

Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context

Yuanbo Hou, Yanru Wu, Qiaoqiao Ren, Shengchen Li, Stephen Roberts, D. Botteldooren

AI Summary

The paper introduces a new geospatial audio tagging (Geo-AT) task, which conditions multi-label sound event tagging on geospatial semantic context (GSC) alongside audio, to address limitations of audio-only recognition in CASA. To benchmark this task, the authors present Geo-ATBench, a polyphonic audio dataset with geographical annotations, and propose GeoFusion-AT, a geo-audio fusion framework for evaluating different fusion strategies. Experiments demonstrate that incorporating GSC improves audio tagging performance, particularly for acoustically similar labels, and a listening study confirms that Geo-ATBench aligns with human perception.

Key Contribution

Geospatial context is a surprisingly effective prior for audio tagging, especially when sounds are acoustically similar, leading to improved performance over audio-only methods.

Abstract

Environmental sound understanding in computational auditory scene analysis (CASA) is often formulated as an audio-only recognition problem. This formulation leaves a persistent drawback in multi-label audio tagging (AT): acoustic similarity can make certain events difficult to separate from waveforms alone. In such cases, disambiguating cues often lie outside the waveform. Geospatial semantic context (GSC), derived from geographic information system data, e.g., points of interest (POI), provides location-tied environmental priors that can help reduce this ambiguity. A systematic study of this direction is enabled through the proposed geospatial audio tagging (Geo-AT) task, which conditions multi-label sound event tagging on GSC alongside audio. To benchmark Geo-AT, Geo-ATBench is introduced as a polyphonic audio benchmark with geographical annotations, containing 10.71 hours of audio across 28 event categories; each clip is paired with a GSC representation from 11 semantic context categories. GeoFusion-AT is proposed as a unified geo-audio fusion framework that evaluates feature-, representation-, and decision-level fusion on representative audio backbones, with audio- and GSC-only baselines. Results show that incorporating GSC improves AT performance, especially on acoustically confounded labels, indicating geospatial semantics provide effective priors beyond audio alone. A crowdsourced listening study with 10 participants on 579 samples shows that there is no significant difference in performance between models on Geo-ATBench labels and aggregated human labels, supporting Geo-ATBench as a human-aligned benchmark. The Geo-AT task, benchmark Geo-ATBench, and reproducible geo-audio fusion framework GeoFusion-AT provide a foundation for studying AT with geospatial semantic context within the CASA community. Dataset, code, models are on homepage (https://github.com/WuYanru2002/Geo-ATBench).

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References44

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context

Related Papers