Search papers, labs, and topics across Lattice.
The paper introduces ProHist-Bench, a new benchmark designed to evaluate the historical reasoning capabilities of LLMs using questions derived from the Chinese Imperial Examination system. This benchmark features 400 expert-curated questions across eight dynasties, along with detailed evaluation rubrics, to assess higher-order skills like evidentiary reasoning. Evaluation of 18 LLMs on ProHist-Bench reveals a significant gap in their ability to handle complex historical research questions, highlighting the need for domain-specific reasoning advancements.
LLMs still can't pass history class: even state-of-the-art models struggle with complex historical reasoning, as revealed by a new benchmark based on the Chinese Imperial Examination.
While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench.