Search papers, labs, and topics across Lattice.
TriBench-Ko, a new Korean benchmark, is introduced to evaluate the risks of LLMs in judicial workflows, focusing on tasks like jurisprudence summarization and precedent retrieval. The benchmark assesses models against multiple deployment risk categories, including inaccuracy, biases, inconsistencies, and adjudicative overreach, using real judicial decisions. Evaluations reveal that current LLMs struggle with precedent retrieval and often fail to capture critical legal information, highlighting the need for caution when deploying LLMs in judicial contexts.
LLMs in Korean judicial workflows are surprisingly prone to hallucination, bias, and inconsistency, especially when retrieving precedents and summarizing jurisprudence.
Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day-to-day judicial processes. To address this, we publicly release TriBench-Ko, a Korean benchmark designed to evaluate potential deployment risks of LLMs within the context of verified judicial task requirements. It covers four core tasks: jurisprudence summarization, precedent retrieval, legal issue extraction, and evidence analysis. It jointly assesses model behavior across multiple deployment risk categories, including inaccuracy (hallucination, omission, statutory misapplication), biases (demographic, overcompliance), inconsistencies (prompt sensitivity, non-determinism), and adjudicative overreach. Each item is structured to systematically assess both task performance and a specific risk type based on real judicial decisions. Our evaluation of a range of contemporary LLMs reveals that many models frequently manifest significant risks, most notably struggling with precedent retrieval and failing to capture critical legal information. We provide a comprehensive diagnosis of these LLMs and pinpoint critical areas where LLM-generated outputs in judicial contexts necessitate rigorous inspection and caution. Our dataset and code are available at https://github.com/holi-lab/TriBench-Ko