NVIDIACaltechCedars-SinaiUSCMay 25, 2026arXiv:2605.25440

A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback

Rafal Kocielnik, J. Everett Knudsen, Steven Y. Cen, Jasmine Lin, Cherine H. Yang, Atharva Deo, Ujjwal Pasupulety, Peter Wager, Anima Anandkumar, Andrew J. Hung

AI Summary

This paper introduces a two-stage LLM framework to automatically assess the quality of surgical feedback by discovering interpretable criteria like clarity and urgency. Multi-agent prompting and surgical domain knowledge are used to generate these criteria, which are then used by an LLM-as-a-judge to score feedback instances. The framework outperforms content-based methods in predicting feedback effectiveness, as measured by trainee behavioral adjustments and trainer approval, on a dataset of 4.2k feedback instances.

Key Contribution

LLMs can discover interpretable and effective criteria for assessing communication quality, outperforming traditional content-based methods in predicting real-world impact.

Abstract

Verbal feedback delivered by attending surgeons in the operating room plays a critical formative role in resident trainee skill acquisition. Yet, assessing the quality of trainer feedback and its effectiveness in influencing trainee behavior during live surgery remains a challenge. Prior studies assessed feedback content relying on extensive manual annotation by expert human raters and focused on developing broad taxonomies that overlook the qualitative aspects of feedback delivery such as clarity or urgency. Limited existing automated methods, including keyword analysis and topic modeling, also fail to capture these nuanced aspects. We introduce a two-stage LLM-based framework that discovers interpretable feedback quality criteria grounded in the context of surgical training. Our method uses multi-agent prompting and surgical domain knowledge injection to discover a small set of human interpretable scoring criteria (e.g., Encouraging, Urgent, Clear). These criteria are then used to automatically score live surgical feedback via an LLM-as-a-judge approach. Evaluation on 4.2k trainer feedback instances demonstrates that our AI-discovered criteria outperform prior content-based frameworks in predicting feedback effectiveness, including observed trainee behavioral adjustments and trainer approval. This work advances scalable, human-aligned assessment of communication quality in the operating room and provides a foundation for improving surgical teaching practices.

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback

Related Papers