Search papers, labs, and topics across Lattice.
This paper introduces a modular evaluation framework and the Multi-Scenario Government Affairs Benchmark (MSGABench1) dataset for assessing large language models (LLMs) in the government affairs domain. The work addresses the lack of standardized evaluation criteria and concerns about accuracy, reliability, and security that hinder LLM adoption in government settings. Empirical evaluation of 15 LLMs using the framework and dataset revealed performance limitations, security vulnerabilities, and task avoidance behaviors, highlighting areas for improvement in LLM deployment within the government sector.
LLMs struggle with basic accuracy, reliability, and security in government affairs tasks, with some models failing on minor input variations and exhibiting task avoidance.
The rapid evolution of AI has driven advancements across numerous sectors. In the domain of government affairs, large language models (LLMs) hold significant potential for applications such as policy analysis, data processing, and decision support. However, their adoption in government settings faces considerable challenges, including data accessibility issues, the absence of standardized evaluation criteria, and concerns regarding model accuracy, reliability, and security. To address these challenges, we propose a comprehensive evaluation framework specifically designed for LLMs in government affairs. Built on modular principles, this framework ensures adaptability across various industries. Additionally, we introduce the Multi-Scenario Government Affairs Benchmark (MSGABench1) dataset, a Chinese-language dataset specifically crafted to meet the practical needs of government professionals. Employing the proposed framework and the MSGA dataset, we conducted an empirical evaluation of 15 prominent LLMs, revealing critical insights: (1) Performance: Many models demonstrated low accuracy and reliability, particularly under minor input variations, with some dropping below 35% accuracy, whereas GPT-4 achieved above 95% reliability; (2) Security and Compliance: Significant concerns were identified, including privacy vulnerabilities, legal compliance risks, and persistent biases, which may hinder secure deployments in government contexts; (3) Task Avoidance: Certain models exhibited excessive caution, often avoiding responses to basic tasks like document classification and government-related inquiries, which restricts their usability. These findings highlight essential limitations and opportunities for improvement, contributing to the safe and effective application of LLMs in the government sector.