Search papers, labs, and topics across Lattice.
This paper introduces a multi-stage framework for depression detection that uses LLMs to generate progressively refined clinical summaries from multimodal data (text, audio, video). These summaries guide a fusion module, enhancing both accuracy and interpretability in binary screening, severity classification, and continuous regression tasks. Experiments on E-DAIC and CMDC datasets demonstrate state-of-the-art performance and improved rationale transparency.
LLMs can generate clinical summaries that not only improve the accuracy of multimodal depression detection but also provide transparent rationales for those predictions.
Depression remains widely underdiagnosed and undertreated because stigma and subjective symptom ratings hinder reliable screening. To address this challenge, we propose a coarse-to-fine, multi-stage framework that leverages large language models (LLMs) for accurate and interpretable detection. The pipeline performs binary screening, five-class severity classification, and continuous regression. At each stage, an LLM produces progressively richer clinical summaries that guide a multimodal fusion module integrating text, audio, and video features, yielding predictions with transparent rationale. The system then consolidates all summaries into a concise, human-readable assessment report. Experiments on the E-DAIC and CMDC datasets show significant improvements over state-of-the-art baselines in both accuracy and interpretability.