FreiburgFeb 16, 2026arXiv:2602.14594

The Wikidata Query Logs Dataset

AI Summary

The paper introduces the Wikidata Query Logs (WDQL) dataset, a collection of 200k question-SPARQL query pairs derived from real-world Wikidata Query Service logs, significantly expanding existing datasets of this type. To create the dataset, the authors developed an agent-based method to de-anonymize, clean, and verify logged SPARQL queries, generating corresponding natural language questions. Experiments demonstrate the dataset's utility for training question-answering models.

Key Contribution

WDQL offers a 6x larger, real-world alternative to template-generated datasets for Wikidata question answering.

Abstract

We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 200k question-query pairs over the Wikidata knowledge graph. It is over 6x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the dataset's benefit for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available under a permissive license.

Data Curation & Synthetic Data Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Wikidata Query Logs Dataset

Related Papers