Feb 3, 2026arXiv:2602.04064

The CitizenQuery Benchmark: A Novel Dataset and Evaluation Pipeline for Measuring LLM Performance in Citizen Query Tasks

Neil Majithia, Rajat Shinde, Zo Chapman, Prajun Trital, Jordan Decker, M. Maskey, E. Simperl, Nigel Shadbolt

AI Summary

The paper introduces CitizenQuery-UK, a new benchmark dataset of 22,000 synthetically generated citizen queries and responses based on UK government information from gov.uk. They adapt the FActScore methodology to evaluate 11 LLMs on factuality, abstention, and verbosity in the context of citizen query answering. Results indicate varying performance profiles across models, high variance, low abstention, and high verbosity, highlighting challenges in deploying LLMs for trustworthy public sector applications.

Key Contribution

LLMs answering citizen questions exhibit high performance variance and a tendency to over-explain while rarely admitting uncertainty, raising serious concerns about their trustworthiness in public sector applications.

Abstract

"Citizen queries"are questions asked by an individual about government policies, guidance, and services that are relevant to their circumstances, encompassing a range of topics including benefits, taxes, immigration, employment, public health, and more. This represents a compelling use case for Large Language Models (LLMs) that respond to citizen queries with information that is adapted to a user's context and communicated according to their needs. However, in this use case, any misinformation could have severe, negative, likely invisible ramifications for an individual placing their trust in a model's response. To this effect, we introduce CitizenQuery-UK, a benchmark dataset of 22 thousand pairs of citizen queries and responses that have been synthetically generated from the swathes of public information on $gov.uk$ about government in the UK. We present the curation methodology behind CitizenQuery-UK and an overview of its contents. We also introduce a methodology for the benchmarking of LLMs with the dataset, using an adaptation of FActScore to benchmark 11 models for factuality, abstention frequency, and verbosity. We document these results, and interpret them in the context of the public sector, finding that: (i) there are distinct performance profiles across model families, but each is competitive; (ii) high variance undermines utility; (iii) abstention is low and verbosity is high, with implications on reliability; and (iv) more trustworthy AI requires acknowledged"fallibility"in the way it interacts with users. The contribution of our research lies in assessing the trustworthiness of LLMs in citizen query tasks; as we see a world of increasing AI integration into day-to-day life, our benchmark, built entirely on open data, lays the foundations for better evidenced decision-making regarding AI and the public sector.

Citation Metrics

Citations0

Influential citations0

References50

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The CitizenQuery Benchmark: A Novel Dataset and Evaluation Pipeline for Measuring LLM Performance in Citizen Query Tasks

Related Papers