Search papers, labs, and topics across Lattice.
This paper investigates the feasibility of using LLMs as automatic judges for evaluating French medical open-ended question answering (OEQA). The study compares the performance of closed-access, general-purpose, and biomedical domain-adapted LLMs in judging semantic equivalence. Results indicate that LLM-based judgments are significantly influenced by the answer generator, but can be improved through supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) of smaller models, reducing generator sensitivity.
LLMs judging medical QA are heavily biased by the answer's source, but a little fine-tuning can make smaller models surprisingly fair.
Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.