Tsinghua AIIndependent ResearcherFeb 26, 2026arXiv:2602.22775

TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

Joydeep Chandra, Joydeep Chandra, Satyam Kumar Navneet, Satyam Kumar Navneet

AI Summary

The paper introduces TherapyProbe, a methodology leveraging adversarial multi-agent simulation to identify relational safety failures in mental health chatbots, focusing on interaction patterns across conversations. TherapyProbe uses open-source models to generate conversation trajectories and surfaces failures like validation spirals and empathy fatigue. The authors translate these failures into a Safety Pattern Library of 23 failure archetypes, providing design recommendations for developers, clinicians, and policymakers.

Key Contribution

Uncovered: mental health chatbots can fall into dangerous "validation spirals" or "empathy fatigue" patterns, revealing critical relational safety flaws missed by current single-turn evaluations.

Abstract

As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness of individual responses? Current safety evaluations assess single-turn crisis responses, missing the therapeutic dynamics that determine whether chatbots help or harm over time. We introduce TherapyProbe, a design probe methodology that generates actionable design knowledge by systematically exploring chatbot conversation trajectories through adversarial multi-agent simulation. Using open-source models, TherapyProbe surfaces relational safety failures interaction patterns like"validation spirals"where chatbots progressively reinforce hopelessness, or"empathy fatigue"where responses become mechanical over turns. Our contribution is translating these failures into a Safety Pattern Library of 23 failure archetypes with corresponding design recommendations. We contribute: (1) a replicable methodology requiring no API costs, (2) a clinically-grounded failure taxonomy, and (3) design implications for developers, clinicians, and policymakers.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References51

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

Related Papers