Search papers, labs, and topics across Lattice.
This paper investigates the impact of LLM-based chatbot suggestions on the accuracy of nonprofit caseworkers providing guidance on social service programs like SNAP. A 770-question benchmark dataset was created to evaluate caseworker performance with varying levels of chatbot accuracy, ranging from 53% to 100%. Results show that high-quality chatbots (96-100% accurate) significantly improve caseworker accuracy by 27 percentage points, but incorrect chatbot suggestions substantially reduce caseworker accuracy, especially on easier questions, and improvements plateau as chatbot accuracy increases.
Beware the "AI underreliance plateau": even highly accurate LLM chatbots can only improve human caseworker accuracy so much, and incorrect suggestions can tank performance on easy questions.
Social service programs like the Supplemental Nutrition Assistance Program (SNAP, or food stamps) have eligibility rules that can be challenging to understand. For nonprofit caseworkers who often support clients in navigating a dozen or more complex programs, LLM-based chatbots may offer a means to provide better, faster help to clients whose situations may be less common. In this paper, we measure the potential effects of LLM-based chatbot suggestions on caseworkers'ability to provide accurate guidance. We first created a 770-question multiple-choice benchmark dataset of difficult, but realistic questions that a caseworker might receive. Next, using these benchmark questions and corresponding expert-verified answers, we conducted a randomized experiment with caseworkers recruited from nonprofit outreach organizations in Los Angeles. Caseworkers in the control condition did not see chatbot suggestions and had a mean accuracy of 49%. Caseworkers in the treatment condition saw chatbot suggestions that we artificially varied to range in aggregate accuracy from low (53%) to high (100%). Caseworker performance significantly improves as chatbot quality improves: high-quality chatbots (96-100% accurate) improved caseworker accuracy by 27 percentage points. At the question-level, incorrect chatbot suggestions substantially reduce caseworker accuracy, with a two-thirds reduction on easy questions where the control group performed best (without chatbot suggestions). Finally, improvements in caseworker accuracy level off as chatbot accuracy increases, a phenomenon that we call the"AI underreliance plateau,"which is a concern for real-world deployment and highlights the importance of evaluating human-in-the-loop tools with their users.