Search papers, labs, and topics across Lattice.
This study benchmarks Claude Opus 4.5 as a dental clinical decision support system (CDSS) by evaluating its treatment recommendations against gold-standard treatments from dental case reports. BERTScore, using RoBERTa-large, was employed to measure the semantic similarity between AI-generated and expert-provided treatment plans across nine dental specialties. The results show strong semantic alignment (mean BERT Score F1 of 0.8199) and consistent performance across specialties, but also reveal a speed-accuracy trade-off.
Claude Opus 4.5 nails dental treatment recommendations with a BERTScore of 0.8199, but faster answers come at the cost of accuracy.
The integration of Large Language Models (LLMs) into clinical decision support systems represents a significant advancement in healthcare informatics. This study presents a comprehensive evaluation framework for benchmarking LLM-generated dental treatment recommendations using BERT Score as the primary semantic similarity metric. We evaluated Claude Opus 4.5 as a Clinical Decision Support System (CDSS) across 116 dental case reports extracted from the Case Reports in Dentistry journal (2024-2025), spanning nine dental specialties. The BERT Score was calculated using the RoBERTa-large model to measure semantic alignment between AI-generated treatment plans and gold-standard published treatments. Results demonstrated strong semantic alignment with a mean BERT Score F1 of 0.8199 with a standard deviation of 0.0144 (95 per cent confidence interval: 0.8172-0.8225), significantly exceeding the 0.80 threshold (t = 14.90, p < 0.001, d = 1.38). Cross-specialty analysis revealed consistent performance across all nine dental domains (Kruskal-Wallis H = 3.07, p = 0.879), indicating robust generalizability. A significant negative correlation was observed between BERT Score and response time (ρ = -0.371, p < 0.001), suggesting a speed-accuracy trade-off in LLM reasoning. This study contributes a reproducible benchmarking methodology for evaluating LLM performance in specialized clinical domains and demonstrates the potential of BERT Score as a scalable evaluation metric for AI-generated clinical text.