Search papers, labs, and topics across Lattice.
This paper evaluates the reliability and validity of VERA-MH, an open-source AI safety benchmark for mental health applications, specifically in suicide risk detection and response. The study involved simulating conversations between LLM-based user-agents and general-purpose AI chatbots, which were then independently rated for safety by licensed clinicians and an LLM-based judge using a standardized rubric. The results demonstrate strong inter-rater reliability among clinicians (IRR = 0.77) and high alignment between clinician consensus and the LLM judge (IRR = 0.81), supporting the validity and reliability of VERA-MH as a safety evaluation tool.
An open-source AI safety benchmark for mental health, VERA-MH, demonstrates strong alignment between clinician and LLM-based evaluations, suggesting a path toward automated safety assessments.
Millions now use generative AI chatbots for psychological support. Despite the promise related to availability and scale, the single most pressing question in AI for mental health is whether these tools are safe. The Validation of Ethical and Responsible AI in Mental Health (VERA-MH) evaluation was recently proposed to meet the urgent need for an evidence-based, automated safety benchmark. This study aimed to examine the clinical validity and reliability of VERA-MH for evaluating AI safety in suicide risk detection and response. We first simulated a large set of conversations between large language model (LLM)-based users (user-agents) and general-purpose AI chatbots. Licensed mental health clinicians used a rubric (scoring guide) to independently rate the simulated conversations for safe and unsafe chatbot behaviors, as well as user-agent realism. An LLM-based judge used the same scoring rubric to evaluate the same set of simulated conversations. We then examined rating alignment (a) among individual clinicians and (b) between clinician consensus and the LLM judge, and (c) summarized clinicians'ratings of user-agent realism. Individual clinicians were generally consistent with one another in their safety ratings (chance-corrected inter-rater reliability [IRR] = 0.77), establishing a gold-standard clinical reference. The LLM judge was strongly aligned with this clinical consensus overall (IRR = 0.81) and within key conditions. Together, findings from this human evaluation study support the validity and reliability of VERA-MH: an open-source, automated AI safety evaluation for mental health. Future research will examine the generalizability and robustness of VERA-MH and expand the framework to target additional key areas of AI safety in mental health.