Mar 11, 2026arXiv:2603.10876

An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took"Use of Practical AI in Digital Libraries"seriously?

Jennifer D'Souza, Sameer Sadruddin, Maximilian Kahler, Andrea Salfinger, Luca Zaccagna, Francesca Incitti, L. Snidaro, Osma Suominen

AI Summary

This paper introduces a new large-scale, bilingual (English/German) dataset of catalog records annotated with the Integrated Authority File (GND) to facilitate extreme multi-label text classification (XMTC). The dataset includes a machine-actionable GND taxonomy, enabling ontology-aware classification and mapping of text to authority terms. The authors also provide initial experiments and error analyses, encouraging the community to focus on usefulness and transparency in developing AI co-pilots for cataloging.

Key Contribution

A massive, bilingual, authority-grounded dataset could finally make AI-assisted cataloging a reality.

Abstract

Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers'work.

Data Curation & Synthetic Data Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References56

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took"Use of Practical AI in Digital Libraries"seriously?

Related Papers