CNRSMar 4, 2026arXiv:2603.04033

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA

Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Richard Dufour, Benoit Favre

AI Summary

This paper investigates the feasibility of using LLMs as automatic judges for evaluating French medical open-ended question answering (OEQA). The study compares the performance of closed-access, general-purpose, and biomedical domain-adapted LLMs in judging semantic equivalence. Results indicate that LLM-based judgments are significantly influenced by the answer generator, but can be improved through supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) of smaller models, reducing generator sensitivity.

Key Contribution

LLMs judging medical QA are heavily biased by the answer's source, but a little fine-tuning can make smaller models surprisingly fair.

Abstract

Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA

Related Papers