Search papers, labs, and topics across Lattice.
This paper introduces AEPC-QA, a new benchmark for evaluating LLMs on Quebec insurance knowledge, consisting of 807 multiple-choice questions derived from regulatory certification handbooks. The authors benchmark 51 LLMs in both closed-book and RAG settings, finding that chain-of-thought reasoning is crucial, RAG can significantly boost weaker models but hurt stronger ones, and generalist models outperform domain-specific fine-tuned models. The results highlight the need for robustness calibration before deploying LLMs for automated advisory services in this high-stakes domain.
RAG can backfire spectacularly on strong LLMs in Quebec insurance QA, causing "context distraction" and performance regressions, even as it massively boosts weaker models.
The digitization of insurance distribution in the Canadian province of Quebec, accelerated by legislative changes such as Bill 141, has created a significant "advice gap", leaving consumers to interpret complex financial contracts without professional guidance. While Large Language Models (LLMs) offer a scalable solution for automated advisory services, their deployment in high-stakes domains hinges on strict legal accuracy and trustworthiness. In this paper, we address this challenge by introducing AEPC-QA, a private gold-standard benchmark of 807 multiple-choice questions derived from official regulatory certification (paper) handbooks. We conduct a comprehensive evaluation of 51 LLMs across two paradigms: closed-book generation and retrieval-augmented generation (RAG) using a specialized corpus of Quebec insurance documents. Our results reveal three critical insights: 1) the supremacy of inference-time reasoning, where models leveraging chain-of-thought processing (e.g. o3-2025-04-16, o1-2024-12-17) significantly outperform standard instruction-tuned models; 2) RAG acts as a knowledge equalizer, boosting the accuracy of models with weak parametric knowledge by over 35 percentage points, yet paradoxically causing "context distraction" in others, leading to catastrophic performance regressions; and 3) a "specialization paradox", where massive generalist models consistently outperform smaller, domain-specific French fine-tuned ones. These findings suggest that while current architectures approach expert-level proficiency (~79%), the instability introduced by external context retrieval necessitates rigorous robustness calibration before autonomous deployment is viable.