Beijing Big Data CentreChina MobileFeb 13, 2025

The Evaluation Framework and Benchmark for Large Language Models in the Government Affairs Domain

Shuo Liu, Lin Zhang, Weidong Liu, Jianfeng Zhang, Donghui Gao, Xiaofeng Jia

AI Summary

This paper introduces a modular evaluation framework and the Multi-Scenario Government Affairs Benchmark (MSGABench1) dataset for assessing large language models (LLMs) in the government affairs domain. The work addresses the lack of standardized evaluation criteria and concerns about accuracy, reliability, and security that hinder LLM adoption in government settings. Empirical evaluation of 15 LLMs using the framework and dataset revealed performance limitations, security vulnerabilities, and task avoidance behaviors, highlighting areas for improvement in LLM deployment within the government sector.

Key Contribution

LLMs struggle with basic accuracy, reliability, and security in government affairs tasks, with some models failing on minor input variations and exhibiting task avoidance.

Abstract

The rapid evolution of AI has driven advancements across numerous sectors. In the domain of government affairs, large language models (LLMs) hold significant potential for applications such as policy analysis, data processing, and decision support. However, their adoption in government settings faces considerable challenges, including data accessibility issues, the absence of standardized evaluation criteria, and concerns regarding model accuracy, reliability, and security. To address these challenges, we propose a comprehensive evaluation framework specifically designed for LLMs in government affairs. Built on modular principles, this framework ensures adaptability across various industries. Additionally, we introduce the Multi-Scenario Government Affairs Benchmark (MSGABench1) dataset, a Chinese-language dataset specifically crafted to meet the practical needs of government professionals. Employing the proposed framework and the MSGA dataset, we conducted an empirical evaluation of 15 prominent LLMs, revealing critical insights: (1) Performance: Many models demonstrated low accuracy and reliability, particularly under minor input variations, with some dropping below 35% accuracy, whereas GPT-4 achieved above 95% reliability; (2) Security and Compliance: Significant concerns were identified, including privacy vulnerabilities, legal compliance risks, and persistent biases, which may hinder secure deployments in government contexts; (3) Task Avoidance: Certain models exhibited excessive caution, often avoiding responses to basic tasks like document classification and government-related inquiries, which restricts their usability. These findings highlight essential limitations and opportunities for improvement, contributing to the safe and effective application of LLMs in the government sector.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations4

Influential citations1

References34

Year2025

VenueACM Transactions on Intelligent Systems and Technology

Related Papers

Finding related papers...

Search

The Evaluation Framework and Benchmark for Large Language Models in the Government Affairs Domain

Related Papers