ZJUApr 27, 2026arXiv:2604.24690

Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

Lirong Gao, Zeqing Wang, Yuyan Cai, Jiayi Deng, Yanmei Gu, Yiming Zhang, Jia Zhou, Yanfei Zhang, Junbo Zhao

AI Summary

The paper introduces ProHist-Bench, a new benchmark designed to evaluate the historical reasoning capabilities of LLMs using questions derived from the Chinese Imperial Examination system. This benchmark features 400 expert-curated questions across eight dynasties, along with detailed evaluation rubrics, to assess higher-order skills like evidentiary reasoning. Evaluation of 18 LLMs on ProHist-Bench reveals a significant gap in their ability to handle complex historical research questions, highlighting the need for domain-specific reasoning advancements.

Key Contribution

LLMs still can't pass history class: even state-of-the-art models struggle with complex historical reasoning, as revealed by a new benchmark based on the Chinese Imperial Examination.

Abstract

While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench.

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

Related Papers