Feb 4, 2026

Benchmarking Large Language Models for Dental Clinical Decision Support: A BERT Score Analysis of Claude Opus 4.5

A. Aghasy, Muhammad Lutfan Lazuardi, Hari Kusnanto Josef

AI Summary

This study benchmarks Claude Opus 4.5 as a dental clinical decision support system (CDSS) by evaluating its treatment recommendations against gold-standard treatments from dental case reports. BERTScore, using RoBERTa-large, was employed to measure the semantic similarity between AI-generated and expert-provided treatment plans across nine dental specialties. The results show strong semantic alignment (mean BERT Score F1 of 0.8199) and consistent performance across specialties, but also reveal a speed-accuracy trade-off.

Key Contribution

Claude Opus 4.5 nails dental treatment recommendations with a BERTScore of 0.8199, but faster answers come at the cost of accuracy.

Abstract

The integration of Large Language Models (LLMs) into clinical decision support systems represents a significant advancement in healthcare informatics. This study presents a comprehensive evaluation framework for benchmarking LLM-generated dental treatment recommendations using BERT Score as the primary semantic similarity metric. We evaluated Claude Opus 4.5 as a Clinical Decision Support System (CDSS) across 116 dental case reports extracted from the Case Reports in Dentistry journal (2024-2025), spanning nine dental specialties. The BERT Score was calculated using the RoBERTa-large model to measure semantic alignment between AI-generated treatment plans and gold-standard published treatments. Results demonstrated strong semantic alignment with a mean BERT Score F1 of 0.8199 with a standard deviation of 0.0144 (95 per cent confidence interval: 0.8172-0.8225), significantly exceeding the 0.80 threshold (t = 14.90, p < 0.001, d = 1.38). Cross-specialty analysis revealed consistent performance across all nine dental domains (Kruskal-Wallis H = 3.07, p = 0.879), indicating robust generalizability. A significant negative correlation was observed between BERT Score and response time (ρ = -0.371, p < 0.001), suggesting a speed-accuracy trade-off in LLM reasoning. This study contributes a reproducible benchmarking methodology for evaluating LLM performance in specialized clinical domains and demonstrates the potential of BERT Score as a scalable evaluation metric for AI-generated clinical text.

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueInternational Journal of Advanced Computer Science and Applications

Related Papers

Finding related papers...

Search

Benchmarking Large Language Models for Dental Clinical Decision Support: A BERT Score Analysis of Claude Opus 4.5

Related Papers