Australian-Future-Hearing-InitiativeFeb 23, 2026arXiv:2602.19409

AuditoryHuM: Auditory Scene Label Generation and Clustering using Human-MLLM Collaboration

Henry Zhong, Jörg M. Buchholz, Julian Maclaren, Simon Carlile, Richard F. Lyon

AI Summary

The paper introduces AuditoryHuM, a framework that leverages Multimodal Large Language Models (MLLMs) to generate and cluster auditory scene labels in an unsupervised manner. It uses zero-shot learning (Human-CLAP) to align generated text labels with audio content and incorporates human-in-the-loop intervention to refine poorly aligned pairs. The discovered labels are then clustered using a modified silhouette score to balance cohesion and granularity, demonstrating a scalable and low-cost solution for creating standardized taxonomies across three auditory scene datasets.

Key Contribution

Forget painstakingly labeling audio datasets – AuditoryHuM uses LLMs and targeted human input to automatically generate and cluster high-quality auditory scene labels.

Abstract

Manual annotation of audio datasets is labour intensive, and it is challenging to balance label granularity with acoustic separability. We introduce AuditoryHuM, a novel framework for the unsupervised discovery and clustering of auditory scene labels using a collaborative Human-Multimodal Large Language Model (MLLM) approach. By leveraging MLLMs (Gemma and Qwen) the framework generates contextually relevant labels for audio data. To ensure label quality and mitigate hallucinations, we employ zero-shot learning techniques (Human-CLAP) to quantify the alignment between generated text labels and raw audio content. A strategically targeted human-in-the-loop intervention is then used to refine the least aligned pairs. The discovered labels are grouped into thematically cohesive clusters using an adjusted silhouette score that incorporates a penalty parameter to balance cluster cohesion and thematic granularity. Evaluated across three diverse auditory scene datasets (ADVANCE, AHEAD-DS, and TAU 2019), AuditoryHuM provides a scalable, low-cost solution for creating standardised taxonomies. This solution facilitates the training of lightweight scene recognition models deployable to edge devices, such as hearing aids and smart home assistants. The project page and code: https://github.com/Australian-Future-Hearing-Initiative

Data Curation & Synthetic Data Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AuditoryHuM: Auditory Scene Label Generation and Clustering using Human-MLLM Collaboration

Related Papers