Search papers, labs, and topics across Lattice.
The study compares the quality and accuracy of portfolio feedback generated by GPT-4o and Claude-sonnet-4 (via Amazon Bedrock) in the context of Qpercom's digital assessment tools for high-stakes clinical assessments. It analyzes both preview feedback (for examiners) and direct student feedback, evaluating how well each model identifies different levels of student performance. The findings assess the safety, constructiveness, and educational value of the AI-generated feedback.
AI-generated feedback on student portfolios from GPT-4o and Claude-Sonnet-4 shows promise for high-stakes clinical assessments, but careful evaluation is needed to ensure accuracy and educational value.
This report provides an in-depth comparative analysis of AI-generated portfolio feedback delivered through two leading Large Language Model platforms (LLM): Gpt4.o OpenAI and Claude-sonnet-4 (Anthropic) via Amazon Bedrock. The feedback was analyzed in two distinct stages: preview feedback, which serves as a safety and verification layer for examiners/administrators, and portfolio feedback, which is delivered directly to the students. These systems are integral to Qpercom’s digital assessment tools and support high-stakes clinical assessments such as Objective Structured Clinical Examinations (OSCEs), high-stake recruitment using Multiple Mini Interviews (MMIs), and Video Interviewing and Digital Scoring (VIDS). This evaluation examines how accurately each model reflects students’ actual high-, mid-, and underperformance and whether its feedback provides safe, constructive, and educationally valuable input.